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The  Mission  of  AGARD 


According  to  its  Charter,  the  mission  of  AGARD  is  to  bring  together  the  leading  personalities  of  the  NATO  nations  in  the  fields 
of  science  and  technology  relating  to  aerospace  for  the  following  purposes: 

—  Recommending  effective  ways  for  the  member  nations  to  use  their  research  and  development  capabilities  for  the 
common  benefit  of  the  NATO  community; 

—  Providing  scientific  and  technical  advice  and  assistance  to  the  Military  Committee  in  the  field  of  aerospace  research  and 
development  (with  particular  regard  to  its  military  application); 

—  Continuously  stimulating  advances  in  the  aerospace  sciences  relevant  to  strengthening  the  common  defence  posture; 

—  Improving  the  co-operation  among  member  nations  in  aerospace  research  and  development; 

—  Exchange  of  scientific  and  technical  information; 

—  Providing  assistance  to  member  nations  for  the  purpose  of  increasing  their  scientific  and  technical  potential; 

—  Rendering  scientific  and  technical  assistance,  as  requested,  to  other  NATO  bodies  and  to  member  nations  in  connection 
with  research  and  development  problems  in  the  aerospace  field. 

The  highest  authority  within  AGARD  is  the  National  Delegates  Board  consisting  of  officially  appointed  senior  representatives 
from  each  member  nation.  The  mission  of  AGARD  is  carried  out  through  the  Panels  which  are  composed  of  experts  appointed 
by  the  National  Delegates,  the  Consultant  and  Exchange  Programme  and  the  Aerospace  Applications  Studies  Programme.  The 
results  of  AGARD  work  are  reported  to  the  member  nations  and  the  NATO  Authorities  through  the  AGARD  series  of 
publications  of  which  this  is  one. 

Participation  in  AGARD  activities  is  by  invitation  only  and  is  normally  limited  to  citizens  of  the  NATO  nations. 


The  content  of  this  publication  has  been  reproduced 
directly  from  material  supplied  by  AGARD  or  the  authors. 
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Theme 


In  the  early  90s  the  ubiquitous  availability  of  broader  communication  bands  with  ISDN,  for  example  (Integrated  Service  Data 
Networks),  and  other  channels,  such  as  electronic  mail,  the  availability  of  easy  access  and  user  friendly  protocols,  the  processing 
of  natural  languages  with  the  help  of  artificial  intelligence,  the  use  of  machine-aided  translation  (computer  assisted  or  automatic 
translation)  will  have  many  effects  on  information  retrieval  and  means  of  communication. 

These  new  concepts  will  result  in  better  fulfilment  of  the  needs  of  users,  who  consider  the  delay  in  information  delivery  as  a 
more  and  more  important  factor. 

Our  information  systems  and  centres  must  be  prepared  to  make  full  use  of  these  new  technologies,  which  should  enable  a  better 
fulfilment  of  user’s  needs.  The  final  paper  proposed  short-term  actions  to  be  undertaken  and  recommendations  to  be 
transmitted  to  authorities  and  all  persons  responsible  for  information  systems  and  centres. 
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Au  debut  des  annees  1990, I'omnipresente  disponibilite  de  bandes  de  telecommunications  plus  larges,  avec,  par  exemple,  les 
RNIS  (reseaux  numeriques  a  integration  de  services),  parmi  d'autres  voies  de  transmission  telles  que  le  courrier  electronique,  la 
disponibilite  de  protocoles  conviviaux  et  d’acces  facile,  le  traitement  du  langage  naturel  a  1’aide  de  l'intelligence  artificielle,  et 
la  mise  en  oeuvre  de  la  traduction  machine  (traduction  assistee  par  ordinateur  ou  traduction  automatique)  auront  des 
repercussions  importantes  sur  la  recherche  documentaire  et  les  voies  de  communication. 

Ces  nouveaux  concepts  permettront  une  meilleure  satisfaction  des  besoins  des  utilisateurs,  lesquels  attachent  de  plus  en  plus 
d'importance  a  la  reduction  des  delais  dans  l'acheminement  de  ('information. 

Nos  systemes  et  nos  centres  d’information  doivent  etre  en  mesure  d’exploiter  au  maximum  ces  nouvelles  technologies,  et  de  ce 
fait,  assurer  une  meilleure  satisfaction  des  besoins  exprimes.  La  demiere  intervention  propose  des  ictions  a  entreprendre  a 
court  terme,  ainsi  que  des  recommandations  a  transmettre  aux  autorites  competentes  et  a  tous  ceux  qui  sont  responsables  de 
systemes  ou  de  centres  d’information. 
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Mr  Chairman,  ladies  and  gentlemen, 

It  is  a  great  pleasure  to  welcome  all  of  you  to  Trondheim  and  to  this  AGARD  TIP  specialists’  meeting  on  “Bridging  the 
Communication  Gap”. 

I  am  very  happy  to  see  that  so  many  of  you  have  found  the  way  to  Trondheim,  the  centre  of  M id-Norway.  Geographically,  it  is  not 
in  the  middle  of  the  country,  only  about  '/3  of  the  way  up  to  North  Cape,  the  most  northern  part  of  mainland  Norway.  If  we  also 
include  Spitzbergen  island,  where  we  have  a  couple  of  Norwegian  towns,  it  is  only  '/>  of  the  distance  up  to  the  most  northern 
point.  Still,  Trondheim  is  at  the  same  latitude  as  Fairbanks  in  Alaska  or  Godhope  in  Greenland. 

To  illustrate  the  length  of  our  country,  if  we  could  turn  Norway  around  its  southern  point.  North  Cape  would  reach  down  to 
Rome  and  the  frosty  Spitzbergen  would  be  placed  in  the  middle  of  the  Sahara  desert  in  Africa.  In  this  long  and  impractical 
country  we  have  only  4  million  people. 

Historically  and  culturally  Trondheim  has  been  the  centre  of  Norway.  It  was  founded  by  the  Viking  King  Olav  Tryggvason  in  the 
year  994  ie  1000  years  ago.  You  will  see  a  monument  to  the  King  in  the  middle  of  the  city.  Trondheim  was  also  the  capital  of 
Norway  during  the  Vildng  Age.  It  is  now  the  third  largest  city  in  Norway. 

It  is  not  because  of  its  past  that  we  have  organised  the  TIP  meeting  in  Trondheim.  It  is  the  leading  city  when  it  comes  to  higher 
technical  education  and  research,  with  its  Norwegian  Institute  of  Technology  and  the  SINTEF  research  organisation.  The 
Technical  University  Library  of  Norway  here  in  Trondheim  is  the  leading  library  in  the  country  in  respect  of  the  development 
and  exploitation  of  modern  computer  based  library  services.  In  the  early  70s  the  library  started  developing  their  BIB  SYS 
system,  and  the  latest  version  of  that  system  has  now  become  a  standard  for  research  libraries  in  Norway. 

AGARD’s  main  task  is  to  stimulate  the  exchange  of  scientific  and  technical  information  in  the  aerospace  field  in  order  to  help 
strengthen  the  NATO  defence  posture.  Last  year  the  AGARD  technical  panels,  with  their  450  scientific  and  technical  experts, 
sponsored  more  than  40  conferences,  which  attracted  a  total  attendance  of  about  6000.  The  panels  of  AGARD  organised  60 
working  groups  and  produced  80  publications.  90  support  projects  to  Greece,  Portugal  and  Turkey  were  established. 

How  do  AGARD’s  admirable  work  and  goals  comply  with  the  revolutionary  political  changes  in  Europe  where  Eastern  Europe 
is  liberating  itself  and  the  Soviet  Union  has  embarked  on  the  long  journey  toward  a  free  society?  The  London  Declaration  on  a 
Transformed  North  Atlantic  Alliance  issued  by  the  North  Atlantic  Council  in  July  this  year  opens  up  NATO  as  an  alliance  of 
change.  It  will  continue  to  provide  for  the  common  defence,  but  NATO  should  also  be  an  institution  where  Europeans, 
Canadians  and  Americans  work  together  not  only  for  the  common  defence,  but  to  build  a  new  partnership  with  all  the  nations 
of  Europe.  The  Alliance’s  integrated  force  structure  will  change  fundamentally  to  include  the  following  three  elements: 

—  NATO  will  field  smaller  and  restructured  active  forces.  These  forces  will  be  highly  mobile  and  versatile  to  allow  for 
maximum  flexibility  in  responding  to  a  crisis. 

—  NATO  will  scale  back  the  readiness  of  its  active  units,  reducing  training  requirements  and  the  number  of  exercises. 

—  NATO  will  rely  more  heavily  on  its  ability  to  build  up  larger  forces  if  and  when  they  might  be  needed. 

To  reduce  the  Alliance's  military  requirements  it  is  also  essential  to  have  sound  arms  control  agreements  with  effective  arms 
reduction  verification  regimes. 

Where  do  these  major  transformations  bring  us  with  respect  to  AGARD?  The  AGARD  NDB  has  set  up  a  group  of  Senior 
National  Delegates  to  work  out  possible  scenarios  and  propose  changes.  We  do  not  have  the  report  from  that  group  yet,  but 
some  of  the  more  important  considerations  would  be  the  following. 

—  The  expenditures  on  military  defence  in  the  NATO  countries  will  decrease.  Some  countries  have  al ready  announced  deep 
cuts  of  the  order  of  20%  over  the  next  5  years. 

—  The  restructuring  of  our  forces,  and  higher  mobility,  can  not  be  achieved  through  organisational  measures  alone.  We  have 
to  develop  new  cost-effective  equipment  suitable  for  the  new  tasks. 


—  A  reduction  of  training  requirements  to  allow  lower  readiness  and  fewer  exercises  will  also  require  new  equipment  which  is 
easier  to  operate  and  which  can  be  stored  for  years  without  being  used. 

—  The  ability  to  build  up  larger  forces  requires  the  capability  to  allow  a  sudden  increase  in  military  production. 

All  these  partly  conflicting  requirements  will  require  greater  effectiveness  of  military  R&D  and  production.  To  reach  these  goals 
the  output  from  the  NATO  R&D  community  has  to  increase  without  a  corresponding  increase  in  expenditure.  The  way  ahead 
includes  measures  such  as: 

—  Putting  more  effort  into  the  reduction  and  avoidance  of  unnecessary  duplication  of  R&D  work  within  the  NATO 
countries. 

—  Better  coordination  between  military  and  civilian  R&D.  In  many  areas  this  means  that  military  R&D  should  be  added 
onto  civilian  R&D  programmes. 

—  More  standardisation  of  equipment  and  support  systems. 

A  more  efficient  way  of  managing  the  scientific  and  technical  information  flow  within  the  countries  as  well  as  between  the 
countries  is  a  key  issue  in  order  to  achieve  these  goals  and  may  be  one  of  the  most  challenging  tasks  for  NATO.  When  we  know 
that  over  6000  scientific  articles  are  written  each  day  and  a  doubling  is  expected  over  6  years  we  all  understand  how  demanding 
this  task  is,  namely  to  get  the  right  information  to  the  right  person  at  the  right  time.  Improving  the  management  and 
dissemination  of  technical  information  could  reduce  the  duplication  of  R&D  effort  in  the  Alliance,  improve  the  quality  of  work 
done  and  thereby  reduce  the  total  expenditure  on  defence. 

AGARD  does  not  have  a  direct  responsibility  for  standardisation  in  NATO,  but  to  achieve  the  NATO  goals,  standardisation  is 
mandatory  when  it  comes  to  reducing  the  cost  of  equipment  and  support  systems.  Indirectly,  AGARD  plays  an  important  role 
in  standardisation  through  the  work  in  specialists'  meetings  and  in  the  panels.  Common  understanding  of  the  technological 
possibilities,  common  terminology  and  exchange  of  information  about  ongoing  research  programmes  have  in  the  past  led  to 
common  approaches  to  solve  problems  and  paved  the  way  for  standards. 

Seen  in  the  perspective  outlined,  this  TIP  specialists’  meeting  on  “Bridging  the  Communication  Gap”  is  highly  relevant  and 
timely.  We  now  see  the  technological  possibilities  to  realize  effective  methods  for  handling  the  information  flow.  The  greatest 
difficulty  will  be  to  get  acceptance  for  common  standards  which  can  ensure  simple  and  effective  exchange  of  information  within 
and  between  the  NATO  countries,  and  between  the  civilian  and  the  military  communities. 

AGARD  meetings  make  it  easier  for  people  from  the  different  organisations  and  countries  to  get  together  and  exchange  views 
and  information  at  the  conference  and  in  a  more  informal  manner  outside  the  conference  room.  To  establish  new  friendships 
and  contacts  is  one  of  the  purposes  of  arranging  AGARD  conferences. 

The  name  Trondheim  comes  from  the  old  Norwegian  “Trondheunr”,  which  means  “The  home  where  you  thrive”.  I  hope  that 
you  all  will  feel  comfortable  here  and  that  you  enjoy  your  stay  in  Trondheim.  I  wish  you  a  successful  meeting. 
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SOMMART 

This  paper  deals  with  the  Optical 
Technology  really  where  the  different 
neabers  of  the  analog,  hybrid  and  digital 
optical  storage  groups  are  presented.  So 
auch  happens  so  fast  in  this  area  and  the 
users  are  confused  concerning  which  aedia 
are  useful  for  what. 

The  different  aedia  have  both  advantages 
and  disadvantages. 

Analog  storage  has  up  to  the  present  been 
the  only  aedlua  for  storage  and 
presentation  of  pictures  and  filas  with 
high  quality  (eg.  LaserVision,  LV-ROM). 

By  digitalization  of  pictures,  each  picture 
needs  fron  0.3  -  1.0  Mbyte  depending  on  the 
quality.  That  aeans  that  a  CD-ROM  only  can 
keep  about  7-900  color  pictures  and  only 
25-35  seconds  fila  presentation. 

A  videodisk  can  keep  aore  than  100  000 
pictures  which  gives  about  one  hour  fila 
presentation. 

The  Involved  producers  are  working  hard  to 
solve  this  problea  by  finding  new 
coapriaatlon  algor lthas. 

The  presentation  will  deal  with  the 
different  types  ■  LaserVision,  RID,  WORM, 
DRAW,  CD-ROM,  CD-ROM-XA,  DVI,  CD-I, 
Rrasables  etc. 

What  about  storage  capacity,  which  type  of 
aatter  is  useful  to  store  on  what  aedia, 
existing  standards,  use  and  Miscellaneous, 
future  prospects  etc.  Main  eaphasis  is 
drawn  to  CD-ROM  which  is  the  aost  coaaon 
used  optical  aedia  in  a  library 
envlronaent. 


IMTRODOCTIOA 

A  PC  can  handle  electronlcly  stored  data. 
This  has  given  publishers  and  inforaatlon 
lnteraedlarles  possibilities  to  distribute 
their  products  by  way  of  electronics. 

Magnetic  storage  has  since  the  coaputer 
technology  entered  our  working  envlronaent 
been  the  doalnatlng  aethod  in  storing  data, 
but  this  aedlua  has  in  fact  the  later  years 
got  coapetltlon  froa  new  storing 
technologies  as  optical  storage  devices  due 
to  these  aedlas  excellent  durability  and 
enormous  storing  capacity. 

Optical  storage  aakes  it  possible  to  store 
text,  pictures,  sound  and  data  at  relative 
low  costs,  and  with  fast  access  to  the 
inforaatlon. 

The  aedlua  is  saall  and  light,  and  gives 
the  user  eccess  to  enormous  quantities  of 
Inforaatlon. 


HX8TORT 

By  the  late  1920s,  encoded  optical  storage 
were  utilised  by  the  recording  of  notion 
picture  soundtracks.  In  the  recording 
process,  the  sound  is  converted  into  an 
electrical  signal  which  aodulates  a  light 
source  that  exposes  the  sountrack  portion 
of  the  fila.  This  produces  a  squiggly  line 
on  the  fila.  This  aethod  of  storing  sound 
inforaatlon  is  analog  optical  storage  and 
is  still  used  for  notion  picture 
soundtracks. 

In  the  1940s  and  1950s,  the  process  of 
aagnetic  recording  was  developed  and  cane 
into  widespread  use.  This  technique 
involves  storing  Inforaatlon  as  a  aagnetic 
pattern  on  a  moving  aagnetic  surface)  a 
tiae-varying  electric  current  passes 
through  a  aagnetic  core,  producing  a 
corresponding  tine- varying  magnetic  field. 

The  digital  information  used  by  computers 
is  stored  as  patterns  of  is  and  0s,  or  OMs 
and  OFFs.  The  encoded  data  does  not  suffer 
froa  the  degradation  that  can  affect  analog 
inforaatlon.  Unlike  an  analog  recording, 
digitally  stored  inforaatlon  is  made  up  of 
only  two  signal  levelsi  ON  or  OFF.  A  copy 
is  not  merely  an  approxlaation  of  the 
digital  source,  but  an  exact  bit  for  bit 
reproduction. 


ANALOG  AMD  DIGITAL  OPTICAL  STORAGE 

14  years  ago  the  dutch  company  Philips 
deaonstrated  its  first  videodisk,  VLP, 

Video  Long  Play  and  in  19B1  Philips  put  the 
disk  on  the  market  under  the  aore  wellknown 
name  LaserVision.  Since  then  the 
development  in  this  area  has  aoved  very 
fast  foreward. 

At  the  aoaent  there  are  about  100  different 
neabers  of  the  optical  storage  family.  Most 
of  then  are  unknown  and  have  appeared  on 
the  aarket  the  last  2-3  years,  and  some 
of  them  are  still  in  the  lab. 

With  so  aany  different  media  to  choose 
between,  it  can  be  difficult  to  choose  the 
right  type  for  a  special  purpose. 

Only  a  few  of  all  the  different  optical 
storing  devices  today  are  in  real  * 
commercial  use. 

The  optical  faaily  can  be  divided  into  two 
aaln  groups,  the  reflective  and  the 
transmissive  group.  This  presentation  will 
deal  with  the  reflective  group,. 


THA  TACRAOLOGT 

There  is  some  differences  between  the 
different  disks,  but  they  have  soae  coaaon 
characteristics.  The  disk  which  is  aainly 
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produced  fro*  plastic  Is  1  -  2  mm  thick  and 
consists  of  different  layers.  The  innermost 
layer  has  a  concentric  track  of  different 
length  depending  on  dlsktype.  This  track 
consists  of  Microscopical  pits  with 
different  size  and  distance. 

On  analog  disks  the  size  and  distance 
between  the  pits  depend  on  the  information 
to  be  stored.  The  reality  Is  that  the 
Information  stored  on  an  analog  disk  is 
stored  in  a  digital  manner,  but  the 
reflected  laser  beam  acting  as  the 
information  carrier  is  after  passing 
through  the  photo  decoder,  decoded  to  an 
analog  electrical  signal. 

Typical  for  the  analog  signal  is  that  it  is 
a  time-varying  signal. 

Digital  storage  of  information  on  the 
archival  disks  ( >20  cm  diameter)  may  be 
represented  as  holes  and  'nonholes'.  Holes 
are  Os  or  OFfh,  and  'nonholes*  is  or  ONs. 
The  size  of  the  holes  are  exact  and  the 
distance  between  the  holes  and  'nonholes* 
are  the  sane  all  over  the  disk.  The 
reflected  laser  bean  represents  a  continous 
stream  of  ONs  and  OFFs  which  after  the 
photo  decoder  is  representing  an  electrical 
signal  which  then  is  decoded  into  ONs  and 
OFFs  by  a  high  technical  and  complicated 
process.  Since  the  computer  works  with 
digital  signals  (ONs  and  OFFs),  we  do  not 
need  any  mechanism  for  converting  the 
signals  from  digital  to  analog. 

To  have  an  as  effective  as  possible  storing 
of  digital  information,  the  digital  compact 
disks  were  developed.  They  are  single  sided 
disks  and  the  binary  numbers  Is  and  Os  are 
not  represented  by  the  low  reflecting  holes 
and  the  high  reflecting  'nonholes* . 

It  is  the  change  between  plt/land  or 
land/plt  that  gives  the  binary  value  1.  The 
pits  and  the  distance  between  the  pits  vary 
and  give  the  respective  Os  (OFFs). 

The  information  stored  on  the  compact  disks 
are  therefor  three  tines  more  close 
together  than  on  the  big  archival  disks. 

The  next  layer  is  an  extremely  thin  layer 
of  polished  aluminium  which  follows  the 
pits  of  the  innermost  layer  and  acts  as  the 
reflective  part  of  the  disk.  The  outermost 
layer  is  a  protective  plastic  coating  of 
polycarbonate.  Fig.  1  shows  the  cross- 
section  of  a  compact  disk. 


Fig.  1  Coapect  Disk  Cross-Section 


An  intensive  and  accurate  laser  bean  in  the 
optical  reader  hits  the  aluminium  foil  of 
the  disk  and  is  reflected  back  to  a  photo 
cell  where  the  signals  are  converted  to 
electrical  signals. 

Because  of  the  different  size  and  distance 
of  the  pits,  the  reflected  laser  beam  will 
differ  in  intensity  and  wavelength  from 
tine  to  tine  and  the  photocell  will  recleve 
different  signals  (Fig.  1). 


Constant  Angular  Velocity  (CAV)  and 
Constant  Linear  Velocity  (CLV) 

CLV 

Before  we  start  presenting  the  different 
main  members  of  the  optical  family,  we  have 
to  explain  two  terms.  The  information  can 
be  stored  at  the  sane  density  all  over  the 
disk.  That  means t  the  disk  reader  has  to 
change  the  rotation  speed  depending  where 
the  laserhead  is  reading  information  on  the 
disk,  and  the  disk  has  to  rotate  with  a 
constant  linear  velocity  (CLV). 

CAV 

When  the  information  is  stored  with  higher 
density  near  the  centre  than  in  the 
periphery  of  the  disk,  the  disk  will  rotate 
at  the  same  speed  independent  of  where  the 
laserhead  is  reading  information.  He  say 
that  the  disk  is  rotating  with  a  constant 
angular  velocity  (CAV). 


(A  I  CU 


Fig.  2  Sector  structure  on  CAV  and  CLV  Disks 


THE  REFLECTIVE  OPTICAL  FAMILY 

The  reflective  optical  family  consists  of 
tree  main  groups*  the  analog,  the  hybrid 
and  the  digital  groups  and  each  group 
contains  many  members  (Fig.  3). 


THREE  STANDARDS 

PAL  (Phase  Alternation  Line)  is  the 
european  TV  standard  giving  25  pictures 
pr.  second. 

NTSC  (National  Television  Systems)  is  the 
anerlcan  standard  giving  30  pictures  pr. 
second. 

8ECAM  (Sequential  A  memo ire)  is  the  french 
standard  and  is  also  used  in  East-Europe. 

A  PAL  disk  can  not  be  used  on  a  NTSC  reader 
and  vice  versa  but  today  socalled 
dualplayers  or  comblplayers  are  on  the 
market  which  can  handle  more  than  Just  one 
standards. 
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CD-DA  : 

Compact  Disc  -  Digital  Video 

CD-ROM  : 

Compact  Disc  -  Read  Only  Memory 

CD-ROM  : 

CD-ROM  Extended  Architecture 

WORM 

Write  Once  Read  Many 

DRAW  : 

Direct  Read  After  Write 

DVI  : 

Digital  Video  Interactive 

CD  I  : 

Compact  Disc  Interactive 

ERASABLE  : 

Erasable  Disc 

LV-R0M 

LaserVision  -  Read  Only  Memory 

CD- IV  : 

Compact  Disc  -  Interactive  Video 

CD-V 

Compact  Disc  -  Video 

LV 

LaserVision 

RID  : 

Recordable  Interactive  Disc 

Fig.  3  Some 

of  the  members  of  the  Optical 

Storage  Family 


THE  ANALOG  DISKS 
LasarVlslon 

The  vldeodlsk  or  the  aore  wellknown  name 
LaaerVlaion  waa  introduced  on  the  aarket  In 
1981  by  Philips  and  aucceded  the  VLF  - 
Video  Long  Playing  disk. 

This  disk  is  quite  suitable  for  storing 
pictures,  aoviea  (video)  and  sound  at  high 
quality.  The  disk  with  a  dlaaeter  of  30  ca 
contains  54  000  tracks  on  each  side  for 
storing  pictures  and  two  soundtracks  for 
stereo  sound  or  possibilities  for  bilingual 
sound. 

Each  track  can  be  accessed  randoaly,  which 
aakes  it  possible  to  acess  an  exact  picture 
in  very  short  tlae.  The  entire  disk  can  be 
scanned  in  25  seconds. 

This  is  an  advantage  coapared  with  the 
videotapes  where  we  need  such  sore  tine  to 
find  an  exact  picture  sequence. 

The  videodisk  is  an  advanced  extension 
(version)  of  the  videotape  and  was  original 
developed  for  the  entertainaent  industry. 
The  disks  exists  as  bot  CAV  and  CLV  disks. 


LaserVlslon-CAV 

On  the  CAV  disk  one  picture  is  stored  on 
each  track,  each  one  with  an  unique 
address.  The  disk  rotates  with  a  constant 
angular  velocity.  A  disk  in  PAL  foraat  can 
keep  36  alnutes  video  or  54  000  single 
pictures  with  stereo  sound  on  each  side. 
Because  of  the  unique  addressing  of  each 
picture,  the  disk  Is  wall  fitted  for 
presenting  single  ploturas  (dies)  or  short 
flla  sequences. 

The  disk  can  be  contolled  froa  a  reaote 
controller  or  a  PC/coaputer  and  Is  therefor 


an  interactive  systea,  but  the  data  on  the 
disk  can  not  be  treated  by  a  coaputer 
without  special  applications. 

The  disk  is  produced  for  all  three 
standards  PAL,  MT8C  and  SBC AM. 

The  priaary  aarket  for  the  disk  is  schools, 
education,  training,  advertising  etc.  Under 
noraal  conditions,  the  disk  rotates  at  1 
500  rpa. 


LaserVision-CLV 

The  disk  exists  for  all  three  standards 
(PAL,  NTSC  and  SECAM)  and  is  quite  slailar 
the  CAV  disk  apart  froa  that  the  disk 
rotates  at  constant  linear  velocity  and 
therefor  can  hold  2x60  ain  video  with 
stereo  sound. 

Under  noraal  condition  in  PAL  foraat  the 
rotational  speed  varies  froa  1  500  -  570  pr 
ainute. 

The  disk  is  suitable  for  storing  aovles  and 
is  aiaed  at  the  entertainaent  and  consuaer 
aarket,  but  the  disk  appeared  a  little  to 
late  on  the  aarket,  since  the  videocassette 
already  was  established  with  a  very  large 
aarket  share. 


LV-CAV-P  and  LV-CAA 

Two  other  products  in  the  LaserVislon 
faally  is  the  LV-CAV-P  disk  where  P  is  the 
acronya  for  Professional  and  the  LV-CAA 
disk  where  CAA  is  the  acronya  for  Constant 
Angular  Acceleration. 

The  first  one  aay  contain  36  Bin  video  or 
54  000  single  pictures,  stereo  sound  and 
350  Hb  inforaation  on  each  side  and  the 
second  2x2  700  single  pictures  and  2x22,5 
Bin  video  with  stereo  sound. 

The  disks  are  used  for  autoaatic 
inforaation  and  sale  applications  (point  of 
sales  and  point  of  inforaation). 

The  disks  are  also  used  for  presentations 
on  aultivlsion  or  videowalls  with  16,  24  or 
30  TVs  placed  together.  All  the  screens  can 
present  the  sane  picture,  parts  of  a 
greater  picture  or  pictures  with  different 
configuration. 

Heccesary  equipaent  is  a  coaputer  to 
distribute  the  different  pictures  to  the 
different  screens. 

LV-CAV-P  appeared  on  the  aarket  in  1981  and 
LV-CAA  in  1987. 


RID-CLV  and  RID-CAV 

RID  is  the  acronya  for  Recordable 
Interactive  Disk  and  both  the  CAV  and  CLV 
disks  appeared  on  the  aarket  in  1987.  The 
CLV  disk  can  keep  2x60  alnutes  video  with 
stereo  sound  and  the  CAV  disk  2x54  000 
single  pictures  or  2x30  alnutes  video  with 
stereo  sound.  Both  disks  are  for  local 
recording  of  video  and  the  CLV  disk  also 
for  local  recording  of  still  pictures. 

They  are  useabel  for  local  recording  in 
connection  with  aarketing,  sale  and 
education. 

Recordable  videodisks  open  up  for  big 
possibilities  since  the  quality  is  auch 
better  than  for  videotapes. 

The  CAV  disk  is  also  suitable  for 
aultivlsion  or  videowalls. 
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80MB  C0MMSBT8 

The  videodisks  war*  developed  for  tha 
storing  and  playing  of  vldao  notarial  aa 
novies  and  audioviaual  infornation  for 
adueation,  antartalnnant,  narkating  and 

aala. 

The  dlska  can  alao  contain  digital  data. 

Tha  digital  infornation  la  then  ancodad  on 
a  vldeoaignal  and  atorad  on  tha  diak  in  tha 
aana  nannar  aa  an  analog  signal,  but  thare 
is  aona  difficulties  since  the  pits  are 
vary  snail  (1  nlcronatar)  and  sensitive  for 
dust  particles  and  irregularities  on  the 
disk. 

Since  the  pictures  on  a  videodisk  is  stored 
as  analog  signals  they  can  not  be  treated 
or  nanipulated  by  a  conputer  without 
special  equipnent.  To  day  we  can  install  an 
lnterfacecard  in  our  PC  (eg.  MIC  4000), 
which  nakes  it  possible  to  convert  analog 
stored  pictures  on  the  videodisk  into 
digital  data  for  PC  use.  But  the  storage 
requlrenents  for  an  analog  picture  in  good 
TV  quality  is  fron  1/2  -  1  Mb  in 
digitalized  forn.  That  naans  for  a  600  Mb 
conpactdlsk  only  30-35  seconds,  of  video 
(PAL). 


THB  HYBRIDS 

This  aedlun  is  based  on  analog  storing  of 
pictures  or  video  while  the  sound  or  test 
are  digitalized.  Therefor  the  nane  hybrid, 
it  is  a  oastard  since  it  has  the 
possibilities  for  both  analog  and  digital 
storing. 


CD-V  -  Conpact  Disk  -  Video 

The  disk  uses  analog  vldao  and  digital 
sound.  CD-V  is  based  on  the  standard 
described  in  THB  BLOB  BOOB  (Philips)  and 
was  Introduced  to  the  narket  in  1988.  The 
standard  defines  three  different  disk 
sizes,  12  cn  (CD-V-S  Conpact  Disk  Video 
Single),  20  cn  (CD-V-BP  Conpact  Disk  Video 
Extended  Play)  and  30  cn  (CD-V-LP  Conpact 
Disk  Video  Long  Play). 

All  different  sizes  can  be  played  on 
socalled  'conlplayers* . 

The  use  of  analog  video  naans  that  we  have 
to  take  into  consideration  the  three 
standards  PAL,  HT8C  and  SBCAM. 

The  12  cn  CD-V  has  a  golden  surface  to 
distinguish  it  fron  the  nore  fanous 
relative  CD-DA  (Conpact  Disk  -  Digital 
Audio)  of  the  sane  size. 

Bach  disk  can  contain  9  000  single  r-'ictures 
or  6  Bin  video  (PAL)  and  20  ain  stereo 
sound  (Hiri). 

The  BP  and  LP  disks  are  dobbleslded  and 
keep  20  and  30  nin  stereo  sound  (HlPi)  on 
each  side. 

The  disks  are  directed  to  the  consuner 
narket. 

Conblplayers  can  handle  CD-DA  (digital 
nuslc),  CD-V-S,  CD-V-BP,  CD-V-LP  (digital 
nuslc  *  TV)  and  LV-CLV  (analog  nuslc  ♦  TV). 


LV-BOM  -  LaserVislon  Bead  Only  Meaory 

The  disk  is  for  lnteraetlv  use  and  was 
introduced  on  the  auropaan  narkat  in  lata 


1986  and  kaaps  about  100  000  pictures 
(PAL),  sound  and  data.  The  pictures  are 
stored  in  analog  forn  while  the  sound  and 
text  are  digitalized. The  disk  is  useful  for 
educational  and  instructional  purposes,  and 
the  users  are  both  fron  schools, 
universities  and  private  conpanies. 

One  of  the  nore  known  projects  where  LV- 
ROM  is  used  is  THB  DOMBSDAY  project.  The 
Donesday  Project  connenorates  the  900th 
anniversary  of  the  Donesday  Book,  conpiled 
on  the  orders  of  King  Killian  I.  However 
instead  of  parchnent,  the  1986  version  uses 
LV-ROM  technology,  which  Incorporates 
digital  data  on  a  LaserVislon  disk.  The 
database  for  the  project  is  contained  on 
two  disks,  a  national  and  a  connunity  one. 


CD -IV  -  Conpact  Disk  Interactive  Video 

A  12  cn  disk  contains  600  Mb  of  data.  It  is 
a  conbination  of  CD-I  and  CD-V  technology 
and  can  store  entertainnent,  facts  and 
educational  Infornation. 

CD -IV  is  a  detached  systen  with  it's  own 
CPU  which  is  hooked  up  to  a  TV,  and 
controlled  with  a  reaote  control. 

Tine  for  presentation  is  planned  to 
1990/91. 


THB  DIGITAL  SYSTEM 
History 

Philips  was  the  pioneer  within  the  area  of 
digital  optical  storage.  They  started 
experlnents  in  the  early  70s.  These  early 
experlnents  lead  to  the  Megadoc  systen 
which  was  Introduced  In  197B  and  uses  a  30 
cn  optical  disk  that  contains  3  Mb  of  data. 
Infornation  such  as  text,  pictures  or  sound 
are  converted  to  digital  forn,  and  stored 
as  binary  codes  (pits  and  lands)  on  the 
high  reflecting  optical  disk. 


CD-DA  Conpact  Disk  Digital  Audio 
The  Red  Book 

Although  the  laserdlsk  did  not  prove  to  be 
a  success  in  the  consuner  narket,  it  led 
directly  to  the  developnent  of  the  digital 
audio  conpact  disk,  the  largest  success  to 
date  within  consuner  electronics. 

Philips  and  Sony  are  conpetitors  and  hard 
working  in  the  optical  storage  nedla  area. 
But  they  do  cooperate  through  an  agreenent 
to  share  their  optical  disk  technology  in 
order  to  develop  a  standard  for  optical 
recording  of  digital  encoded  nuslc 
specified  in  Tbe  Red  Book  of  June  1980. 

In  late  1982  they  began  to  narket  audio 
conpact  players.  The  standard  describes  how 
the  data  (nusic)  should  be  organized  on 
the  disk  (CD-DA).  It  is  the  sane  concept  as 
applied  for  the  CD-ROM  except  that  nuch 
higher  denands  are  nade  regarded  on 
niscorrectlon  and  coding  of  the  data. 

The  CD-DA  players  were  an  alnost  lnnedlate 
success  and  the  consuners  eagerly  accepted 
the  nedlun  because  of  its  overwhelnlng 
advantages. 

Sons  figures  show  this  explosion  by  35  000 
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players  sold  In  1983,  200  000  In  1984, 

800  000  In  1985  and  about  SO  Millions  in 
usa  in  1990. 

Tba  12  cm  CD-DA  disk  can  kaap  72  aln  HiFi 
sound  and  32  Mb  data,  tost  and  picturas  and 
tha  disks  ara  singlasldad. 

Tha  CD  Audio  disk  carrias  an  exact,  digital 
copy  of  tha  master  racording  and  has  an 
extremely  high  sound  quality. 


WORM  -  Writs  Ones  Raad  Many  (Tiaas) 

On  this  nadlua  we  can  store  the  inforaatlon 
by  ourselves,  but  already  stored 
inforaatlon  can  not  be  changed  or  delated. 
Stored  Inforaatlon  can  be  reused  as  aany 
tiaas  you  want. 

The  aost  coaaerclal  WORM  disks  have  one  or 
aore  vacuua-stored  aetal  layers  where  the 
inforaatlon  is  stored.  The  storage  of 
inforaatlon  is  done  by  a  laser  beaa  in  the 
aetal  layer  either  by  holes,  bubbles  or  by 
aelting  soae  of  the  layers  locally 
together. 

The  disk  is  very  stabll  even  after  long 
tlae  of  storage,  and  the  durability  does 
not  decrease  even  after  frequent  use.  The 
first  generation  WORM  disks  were  byproducts 
froa  the  videodisks.  The  12*  disks  becaae 
coaaerslal  available  in  1983  and  could  keep 
1  000  Mb  («  1  6b)  of  inforaatlon  on  each 
side. 

An  exaaple  on  the  WORM  disk  is  the  Megadoc 
systea  froa  Philips.  On  the  Megadoc  disk 
we  can  store  1  6b  of  inforaatlon  on  each 
side.  This  can  be  illustrated  by  500  000 
typewrlted  pages  of  text  -  equal  to  one 
Milliard  keystrokes  or  about  170 
shelfaeters. 

The  second  generation  WORM  readers  becaae 
available  in  1985.  They  are  saaller  and 
cheaper  than  the  first  generation  readers 
and  use  5  1/4*  disk  size.  The  readers  are 
available  as  external  and  internal  Models. 

All  type  of  Inforaatlon  can  be  stored  on 
the  disk.  Examples  are  archiving  of 
docuaents  (patient journals,  contracts, 
letters,  bills,  insurenee  agreeaents, 
banking  inforaatlon  etc),  drawings,  photos, 
tables,  statistics  etc.  lach  document  gets 
it's  own  unique  address  on  the  disk  and  can 
easily  be  retrieved. 


DRAW  -  Dlraet  Raad  After  Write 

This  disk  which  is  also  named  l-DRAW 
(Rrasable  Direct  Read  Aad  Write),  is  quite 
similar  to  the  WORM  disk  and  is  used  for 
storing  the  aaae  type  of  inforaatlon,  but 
differs  in  soae  ways.  If  we  recognize 
soathlng  wrong  with  inforaatlon  already 
stored,  these  parts  of  the  disk  oan  be 
narked  as  unuseable  and  the  correct 
inforaatlon  stored.  The  laser  will  ignore 
the  aarket  areas  on  the  disk.  The  disks 
which  were  introduced  in  1984,  exits  as 
both  5  1/4*  and  12”  with  storing  capacities 
froa  300  to  1  000  Mb  on  each  side. 


CD-ROM  -  Coapact  Disk  Read  Only  Meaory 

Tha  High  8ierra  Oroup  Format  or  The  Tellow 
Book 

As  an  exstension  to  the  CD-DA  specification 
Mentioned  in  The  Red  Book,  Philips  and  Sony 
announced,  a  new  specification  for  data 
storage  in  October  1983. 

In  novenber  1985  coapanles  working  in  the 
optical  media  area  caae  together  near  Lake 
Tahoe  in  High  Sierra  to  aake  agreeaents  for 
a  standard  for  the  new  optical  aedia  CD- 
ROM. 

ROM  is  the  acronym  for  Read  Only  Meaory  and 
means  that  the  disk  aay  only  be  read  and 
not  written  on  by  the  user. 

Representatives  froa  Apple  Computer, 
Hitachi,  Philips,  Microsoft  etc.  named 
themselves  the  High  Sierra  Oroup,  after 
their  first  aeetlngplace. 

Their  common  agreement  becaae  the 
specifications  mentioned  in  The  Tellow 
Book.  The  defacto  standard  becaae  later  ISO 
9660. 


The  CD-ROM 

A  CD-ROM  is  a  12-ca-diaaeter  CLV  disk  which 
can  keep  aore  than  550  Mb  of  data 
corresponding  to  275  000  typewritten  A4 
pages  or  1  600  floppy  disks.  The 
inforaatlon  is  stored  in  a  five  ka 
concentric  track  on  the  disk. 

The  CD-ROM  has  a  digital  output  signal  and 
does  not  need  any  dlgltal/analog  convert 
aechanisa. 

CD-ROM  is  originally  developed  for  storing 
textual  and  graphical  information,  not 
sound  like  CD-DA  and  therefore  it  needs  a 
much  better  error  correction  systea  for  the 
encoded  data  than  CD-DA. 

Beside  CD-DA,  CD-ROM  is  the  only  digital 
optical  storage  aedia  that  has  becoae  a  hit 
and  it  is  specially  within  the  library 
world  the  users  are  found. 

The  CD-ROM  is  aiaed  to  the  library,  school, 
agencies  and  business  aarket. 

Since  the  first  CD-ROM  databases  were 
bibliographical  or  reference  databases 
having  a  corresponding  online  database  or  a 
printed  abstract  journal,  it  was  natural 
that  the  libraries  were  the  first  and  are 
still  the  largest  users  of  the  new  medium. 

Compared  with  Magnetic  disk  storage  the 
advantages  with  CD-ROM  are  quite  obvious » 
Because  it  is  no  contact  between  the 
laserhead  and  the  disk,  tha  CD-ROM  disks  do 
not  suffer  froa  head  crashes  or  physical 
wear. 

The  medium  Itself  is  extremely  durable  and 
difficult  to  damage  and  there  is  as  yet 
nothing  to  suggest  that  they  have  a  limited 
life-span. 

CD-ROM  is  appropriate  for  the  distribution 
of  large  aaounts  of  data.  This  could  be 
electronic  publishing  of  databases,  maps, 
library  catalogs,  abstract  journals,  books, 
magazines  and  periodicals,  aarket  and 
flnaelel  inforaatlon,  parts  list,  repair 
manuals,  historical  data  and  other  aaterlal 
that  Is  typically  distributed  through 
online  services  or  on  paper. 
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CD-I  -  Compact  Disk  Interactive 
The  Breen  Book 

CD-I  Is  like  CD-DA  and  CD-ROM  a  result  of 
the  cooperation  between  Philips  and  Bony 
and  In  february  1986  they  announced  tbs 
standard  for  CD-I. 

The  specification  refared  to  as  Tha  Green 
Book  describes  Methods  for  providing  audio, 
video,  graphics,  text. 

The  specification  for  CD-I  Is  as  strict  as 
for  the  CD-DA  concept  and  is  Intended  to 
provide  a  world  standard  for  coaplete 
crossbrand  compatibility  of  players  and 
software . 

When  I  visited  Philips  In  august  1989  thsy 
were  ready  to  introduce  the  CD-I  equipment 
to  the  market. 

The  12  cm  CD-I  disk  can  keep  650  Mb  of 
information.  These  can  be  used  for  16 
channels  speech  sound  In  mono,  8  channels 
speech  in  stereo,  8  cannals  music,  4 
channels  music  in  stereo,  4  channels  HiPi 
music,  2  channels  H1P1  music  in  stereo  or  1 
channel  CD-DA  audio  in  stereo.  Each  channel 
can  give  70  minutes  playing  time)  that 
gives  a  total  of  about  19  hours  of  mono 
speech. 

Depending  on  the  quality,  the  number  of 
pictures  stored  on  one  disk  differ  from 
1  000  pictures  In  high  quality  and  coaplete 
color  possibilities  (2S6  colors)  to  20  000 
pictures  with  more  simple  graphic  and  few 
colors. 

One  of  the  main  disadvantages  with  CD-I  is 
that  it  is  using  it's  own  CPU  and  is  a 
single  station  which  cannot  be  connected  to 
a  PC,  but  like  a  CD-DA  player  tha  CD-I  unit 
can  be  connected  to  a  TV  set,  videoplayer 
or  stereo  equipment. 

Potential  CD-I  applications  are 
entertainment,  education,  training,  in-car 
navigation,  hone  shopping  and  reference 
tools. 

The  CD-I  concept  has  not  yet  solved  the 
problems  for  full  screen,  full  motion  video 
(notion  pictures)  presentation,  but  full¬ 
screen  frame-based  animation  is  possible. 


CD-ROM  ZA  -  Compact  Disk  Read  Only  Memory 
Extended  Architecture 

CD-ROM  ZA  is  a  12  cm  disk  equal  CD-ROM  and 
a  result  from  the  cooperation  between 
Philips,  Sony  and  Microsoft.  The  first 
release  of  CD-ROM  ZA  was  announced  in  march 
1989.  The  CD-ROM  ZA  specification  makes  it 
possible  for  an  application  to  play  audio 
information  at  the  same  time  that  it 
retrieves  other  interleaved  information 
such  as  text,  images  and  graphics  from  ths 
CD-ROM  disk. 

With  CD-ROM  ZA,  audio  signals  ars 
compressed  to  take  up  less  spaee  on  a  disk. 
By  using  different  levels  of  compression, 
one  disk  may  contain  up  to  14  hours  of  nono 
quality  sound.  But  even  when  using  stereo 
PM  quality  audio,  one  disk  can  still 
contain  4  hours  of  sound. 

The  compression  technology  which  is  defined 


in  the  existing  CD-I  specification  (The 
Green  Book)  is  called  ADPCM  (Adaptive 
Differential  Pulse  Code  Modulation)  and 
will  allow  disks  to  be  coded  with 
compressed  digital  audio  information  in  an 
interleaved  fashion. 

Interleaving  is  an  important  part  of  the 
standard,  and  allows  for  up  to  16  separate 
data  tracks  to  exist  in  each  sector.  In 
this  way  the  application  can  switch 
languages  (audio  and/or  text)  instantly  by 
simply  reading  different  relative  blocks  in 
the  sane  interleaved  file. 

Time-span  for  detailed  specification  is  as 
followst 

Step  1  i  Text  and  sound  (CD-ROM  ♦  ADPCM) 
march  1989 

Step  2  i  Single  presentation  of  images 
(VGA, CD-I)  and  1989 

Stop  3  i  Pull-screen  presentation  of  film 
1991 

The  new  CD-ROM  ZA  format  will  allow 
publishers  to  create  disks  that  will  be 
playable  not  only  on  any  suitably  equipped 
PC  but  also  on  any  CD-I  system. 

Most  of  the  criticism  around  CD-I  is 
related  to  the  fact  that  it  was  based  on  a 
specific  operating  system  and  hardware 
environment  and  hence  would  be  frozen  in 
tine  and  inflexible  in  the  future.  CD-ROM 
ZA  would  appear  to  go  around  this  problem 
for  those  publishers  and  users  who  want  the 
capabilities  of  CD-I  but  in  a  general  PC 
environment  rather  than  in  a  consumer 
hardware  specific  environment. 

CD-ROM  ZA  is  a  bridge  between  CD-ROM  and 
CD-I  and  is  aimed  at  mainly  professional 
and  institutional  markets  and  reference 
type  applications.  The  CD-ROM  ZA  system  is 
basically  transparent  for  any  PC  and 
independent  of  a  particular  CPU  and 
operating  system. 

CD-ROM  ZA  is  Just  an  ordinary  CD-I  for  the 
PC-world. 


DVI  -  Digital  Video  Interactive 

The  DVI  concept  was  developed  at  David 
Sarnoff  Laboratories  of  RCA  in  Princeton, 
Eew  Jersey.  DVI  is  a  cooperation  between 
IBM,  Microsoft  and  Intel.  The  prototype  has 
existed  since  march  1987  and  integrates 
graphics,  audio  and  video  images  with  text. 

The  DVI  protocol  involves  processing  about 
22  Mb  or  12.5  millions  calculations  per 
second  of  text  data,  sound  and  digitally 
encoded  video  images.  At  this  rate  of 
processing,  a  550  Mb  optical  disk  would 
hold  less  than  a  minute  of  data. 

That  is  why  the  the  protocol  involves 
introduction  of  new  tecnology  to  compress 
data  by  a  factor  of  150  to  1  and  to 
dsoonpress  it  so  fast  that  a  30-fraae-per 
second  (RT8C)  image  can  be  transfered  from 
the  disk  to  a  video  monitor  in  real  time. 
This  makes  it  possible  to  get  about  an  hour 
of  video  on  a  small  disk. 

in  a  typical  application  one  of  these  disks 
night  eonbins  20  minutes  of  video  with 
15  000  pages  of  text,  5  000  high  resolution 


still  liagts  and  six  hours  of  audio. 

The  DVI  disk  Is  12  ca  disaster  and  the 
equipment  Is  to  be  connected  to  a  PC. 

The  DVI  concept  Is  build  up  around  two 
Microchips  placed  In  the  PC.  As  mentioned 
above  the  pictures  are  coaperessed  by  a 
factor  of  ISO.  Por  the  coapress  process  two 
video  display  processors  (VDP)  chips  with 
VLSI  (very  large  scale  Interface) 
architecture  are  used.  One  chip  handles  the 
coapress  algorltha  and  peralts  not  only 
slngel  iaages  but  aotlon/video  flla,  text 
and  graphic  in  the  saae  laage.  The  other 
chip  handles  the  dissolution  of  the  laage, 
screen  site  and  the  Interface  against  the 
PC.  The  concept  Is  based  on  storing  the 
canges  In  the  laage. 

The  producers  hope  that  DVI  will  replace 
fllas,  video  casettes,  videodisks  and 
different  coapact  disks. 


IRAS  ABU  DISKS 

The  12  ca  erasable  or  WARM  (Write  And  Read 
Many  ( Times ))  disks  to  day  are  based  on 
magnetooptical  storage  of  the  Information 
where  a  thin  magnetic  layer  covers  the 
reflective  layer  of  the  disk.  The  needles 
in  the  magnetic  layer  Is  placed  horitontal 
in  the  plane  of  the  disk. 

An  intens  laser  beaa  heats  a  part  of  the 
disk  while  a  strong  electric  coll  change 
the  magnetism  in  that  part. 

The  heating  contlnous  until  the  Curie-point 
or  Curie  temperature  Is  reached.  The  Curie 
point  Is  the  temperature  where  a 
ferromagnetic  material  by  heating  looses 
Its  ferromagnetism  and  becomes 
paramagnetic.  This  happends  at  760 ‘C  for 
Iron. 

The  whole  concept  Is  based  on  the  KIRR 
effect  saying  that  changes  in  the 
polarization  condition  for  light  (laser) 
which  is  reflected  from  a  ferromagnetic 
material  (Iron,  cobolt, nickel)  depends  on 
the  magnetic  condition  of  the  natter. 

To  write  Information  on  the  disk,  a  special 
laser  reverses  the  magnetic  direction  of 
each  predetermined  spot  on  the  disk 
surface.  As  the  polarity  Is  switched,  the 
differences  are  read  as  0*s  and  l's  bits  of 
Information.  A  low-lntens  laser  is  used  to 
read  the  Information  on  the  disk  without 
changing  the  polarity  of  the  spots. 

The  disks  can  keep  from  300  to  650  Mb  data. 
Brables,  just  like  WORM  systems,  lack 
standards  In  the  Interfaces,  size  of  disks 
and  drive  system,  and  In  the  lack  of 
drivers,  or  software  for  Interfacing  the 
optical  drives  with  microcomputer 
workstations. 

Another  problem  for  WORM  and  erasables  has 
bean  the  slower  access  time  and  data 
transfer  rates  when  compared  to  Winchester 
hard  drive  systems. 

To  day's  erasables  have  60-70  ms  access 
time  and  5  Mbits  per  second  data  transfer. 
With  increased  production  and  sale,  the 
cost  for  drives  and  disks  will  decrease. 


OTHMt  8YSTIMS 

In  one  system  the  Information  can  be  stored 
on  a  small  "credlt’-card  which  is  read  by  a 
laser. 

In  another  system  the  information  Is  stored 
on  a  paper  band  which  is  read  by  a  light 
bean.  Paperstored  information  needs  much 
more  room  than  the  other  systems.  The  bar 
codes  used  on  commodities  In  the  stores 
contain  13  chiffer  values  describing  where 
the  commodity  is  produced,  who  is  the 
producer  and  name  and  price  of  the 
commodity. 

In  some  systems  the  information  is  stored 
on  tape  casettes.  A  laser  reads  the 
Information  on  the  tape.  One  of  the 
disadvantages  with  the  tape  medium  is  that 
it  takes  more  tine  to  retrieve  the 
information  than  on  a  disk. 


THE  FUTURE 

This  presentation  has  dealt  with  just  a  few 
but  mainly  the  most  important  members  of 
the  optical  media  family,  and  I  have 
focused  on  the  members  of  the  reflective 
part.  The  number  of  members  in  this  family 
have  increased  rapidly  the  last  few  years. 
In  most  of  the  systems  the  information  is 
stored  on  reflective  disks  as  pits  and 
lands,  but  other  optical  storing  methods 
exists. 

Many  of  the  products  have  been  introduced 
to  the  market  with  entusiasn  and  go-ahead 
spirit,  but  since  then  nothing  is  heard 
about  them.  They  are  here  today  and  gone 
tomorrow.  Some  of  the  products  have  never 
reached  the  market,  but  have  ended  it's 
days  in  the  lab. 

The  big  industries  in  the  optical  field 
have  a  lot  of  development  cost  and  hope  to 
make  the  big  score. 

The  problems  the  industry  has  to  face  is 
that  the  technology  is  suffering  from  lack 
of  standards. 

Analog  storage  for  film  presentation  will 
exist  untill  the  digital  technology  can 
handle  the  saae  matter.  To  days  compression 
algorltas  are  not  good  enough  compared  with 
analog  storage. 

CD-DA  is  the  only  product  that  has  became  a 
hit  with  almost  50  million  produced 
players.  The  reason  for  that  success  is  the 
exact  and  well  defined  specification  from 
the  beginning  which  became  a  worldwide 
standard.  In  the  video  tape  world  we  still 
have  the  fight  between  the  Beta  vs.  VH8 
standard  and  in  the  videodisk  world  we  have 
the  different  nonconpatible  products  like 
LaserVlsion,  SelectaVision  and  VKD. 

Every  audio  compact  disk  is  compatible  with 
every  CD  player  in  the  world. 

On  the  digital  side  for  multimedia 
presentation  the  fight  will  be  between  CD- 
ROM  ZA,  CD-I,  DVI  and  the  erasables  and  it 
is  difficult  to  predict  a  winner. 

All  of  the  products  contain  technologies 
and  will  get  further  technologies  that  will 
make  than  of  big  interest  in  the  future. 


Eew  compression  algorithms  will  solve  the 
full  screen  flla  presentation  problem,  but 
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X  think  the  hardware  and  software  producers 
will  sit  on  the  fence  until  standards  have 
been  established.  These  industries  have 
aajor  possibilities,  but  cannot  really  take 
off  until  standards  have  been  established. 
Many  of  the  products  are  of  great  interest, 
but  will  not  go  connercial  because  no  one 
wants  to  risk  large  aaounts  of  noney  in  a 
technology  that  nay  not  succeed. 

The  industry  is  also  suffering  fron  a 
related  problen  of  vision.  The  CD-ROM  and 
nultlnedla  industry  nust  know  where  it  is 
going  and  on  which  narkets  these  new 
technologies  would  be  used,  what  fora  they 
would  take,  or  even  what  they  would  be  used 
for. 

Ontill  the  uses,  foraats  and  econonies 
becones  established,  it  will  be  difficult 
for  nost  libraries  to  just  purchase  nedla 
that  each  require  lnconpatible  extensive 
hardware  platforns.  Libraries  are  often  the 
environnent  where  nultlnedla  are  used  and 
act  as  trend  setters  for  nultlnedla 
workstation . 

The  CD-ROM  nultlnedla  Industry  will  be  a 
najor  area  of  growth  and  coapetition  in  the 
nineties.  Several  ‘standard  wars*  which  is 
no  unfanilar  phenoaena  in  the  conputer 
world  have  begun  and  as  always  in  such 
fights  it  is  realy  difficult  to  predict  the 
result. 

According  to  a  report  by  Frost  &  Sullivan 
Inc.  called  the  Optical  Meaory  Market  in 
Western  lurope  the  optical  storage  aarket 
is  expected  to  soar  by  1993.  The  narket  for 
disk  drives  and  disk  nedla  will  increase 
fron  392  nilllon  in  1988  to  3913  allllon  a 
year  by  1993,  West  Gernany  bulng  the  nost, 
followed  by  Britain  and  France. 

The  snail  erasable  disk  drives  are 
predicted  to  becoae  the  doninant  fora  of 
optical  storage,  accounting  for  72%  of  1993 
optical  disk  drive  dollars. 

In  nedla,  the  snail  write-once  disks  will 
take  85%  of  the  narket. 


RBCOMOHDATIOR 

For  those  of  you  that  want  to  read  no re 
ebout  the  historical  and  technological 
aspects  of  digital  optical  nedla,  1  will 
reconnend  The  Brady  Guide  to  CD-ROM. 

For  those  of  you  who  are  interested  in 
capeclty  figures  and  use  for  both  analogs 
and  digitals,  and  handle  the  sweedlsh 
language  I  will  reconnend  the  Via  Teldok 
publication  (see  literature). 
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Computer-Assisted  Publishing  with  Its  Standards  and  Compatibility  Problems 


Hotter  Wendt,  Christine  Zitam 

Springer- Vertag,  Tiergartenstr.  17,  D-6900  Heidelberg 


Introduction 

The  original  task  of  a  publishing  house,  that  is,  to  publish 
information  in  printed  form,  has  changed  enormously  du¬ 
ring  the  past  few  years.  The  possibilities  offered  by  electro¬ 
nic  data  processing  have  several  effects:  on  the  one  hand, 
more  and  more  authors  are  choosing  to  write  their  "manu¬ 
scripts"  on  word  processing  systems  and  send  them  to  the 
publishing  house  on  data  carriers  or  by  electronic  mail.  On 
the  other  hand,  the  possibility  of  using  one  text  for  several 
purposes  has  increasingly  led  to  information  being  publis¬ 
hed  on  new  media. 


Computer-Assisted  Publishing 

The  term  "computer-assisted  publishing"  is  often  used  for 
publication  of  something  written  by  means  of  electronic 
media  (especially  computers),  i.e.,  for  the  application  of 
computer-assisted  methods  which  make  it  possible  to  find, 
write,  design,  and  store  information  and  to  distribute  this 
information  via  different  exchange  systems. 

In  our  opinion  it  is  important  to  make  a  distinction  between 
the  production  of  information  by  means  of  electronic  media 
in  combination  with  optical  and  non -electronic  techniques 
on  the  one  hand  (e.g.  keyboarding  of  the  data,  administra¬ 
tion  of  the  data  in  a  database,  computer-aided  typesetting 
and  printout  on  an  electronic  printer,  which  are  mainly  the 
task  of  the  author)  and  the  electronic  publishing  of  informa¬ 
tion  on  the  other  hand,  which  merely  means  that  a  text  is 
published  and  distributed  with  the  aid  of  electronic  media. 
We  therefore  tend  to  use  the  term  computer-assisted  pub¬ 
lishing  where  computers  are  used  for  facilitating  production 
and  processing  of  printed  media  and  electronic  publishing 
where  computers  and  communication  systems  serve  to 
make  data  available  through  electronic  channels. 


Using  New  Media 

From  a  very  early  stage,  Springer- Verlag  was  concerned 
withdiepossibilitiesofofliningdauandproducuon  non- 
print  media.  As  early  as  in  1986,  the  first  four  demonstration 
CD-ROMs  were  preaented  at  the  Frankfurt  book  fair  and 
first  marketable  products  woe  produced  in  the  following 
years  (dangerous  foods  CD  ami  mathematics  abstracts 
CTl).AoconfiafR>  our  definition,  this  kind  of  publication  is 
part  of  electronic  publishing  and  will  therefore  not  be 
thacuaaod  in  more  detail.  In  this  connection,  we  just  want  to 


stress  that  for  good  data  retrieval  on  a  CD-ROM  (and  of 
course  on  other  databases  as  well)  documents  must  be 
available  in  structured  form. 


Electronic  Manuscripts 

A  publisher  should,  of  course,  aim  at  directly  using  data 
keyboarded  by  the  author  for  further  processing.  Neverthe¬ 
less,  the  problem  sarising  are  sometimes  so  serious  that  data 
must  be  keyboarded  a  second  time. 

The  first  problem  is  -  although  nowadays,  we  can  possibly 
already  say "  w  as"  -  the  different  diskette  formats,  which  are 
estimated  at  approximately  1300.  For  some  time  now  there 
have  been  conversion  systems  on  the  market  (e.g.  GICO 
from  GEFA)  with  which  almost  all  diskette  formats  can  be 
converted,  so  that  physical  transformation  of  data  no  longer 
seems  to  be  a  real  problem. 

The  second  problem,  which  is  much  more  complicated  to 
solve,  is  still  the  variety  of  coding  systems  used  by  the 
authors.  Of  course,  many  of  them  can  be  made  readable  by 
means  of  translation  programs  but  this  very  often  requires 
an  expenditure  equal  to  or  even  geater  than  the  money 
saved. 

in  the  last  few  years,  trends  have  become  visible  in  the 
coding  systems  favored  by  authors  of  Springer- Veriag: 
particularly  in  the  fields  of  physics,  mathematics  and  com¬ 
puter  science,  an  increasing  number  of  authors  use  TeX  for 
writing  and  formatting  manuscripts,  a  program  which  is 
especially  designed  for  difficult  scientific  texts  with  many 
formulas.  This  has  led  us  to  develop  so-called  macro  packa¬ 
ges  to  facilitate  author's  work,  e.g.  with  regard  to  layout. 
The  use  of  these  macros  has  the  advantage  that  the  expenses 
for  correction  are  reduced  as  additional  typesetting  can  be 
omitted  and  last  but  not  least  books  and  journals  can  be  pu¬ 
blished  faster.  Authors  in  the  field  of  medicine  or  biology 
increasingly  seem  to  trust  in  Microsoft  WORD.  Neverthe¬ 
less,  the  number  of  programs  used  remains  tremendously 
high. 


Document  Interchange 

Despite  the  large  number  of  word  processing  systems,  a 
publisher  should  aim  at  making  a  smooth-running  docu- 
ment  exchange  possible.  Otherwise,  data  cannot  be  used  for 
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simultaneous  use  on  different  media.  A  precondition  for 
this  multiple  use  of  data  is  that  they  are  structured  logically 
and  do  not  contain  layout-specific  details.  TeX  or  WORD, 
the  programs  mentioned  above,  are  therefore  a  suitable  ex¬ 
change  format  only  to  a  very  limited  extent  since  both  of 
them  contain  mainly  layout  specifications. 


SGML 

For  about  ten  years  now,  there  have  been  serious  attempts 
at  standardization.  SGML  (Standard  Generalized  Markup 
Language)  was  developed  in  1985  as  a  neutral  exchange 
format  for  documents,  and  in  1986  it  was  made  into  an  in¬ 
ternational  standard  (ISO  8879). 

SGML  is  a  tool  for  defining  structures  for  documents 
specific  to  a  particular  application,  Le.,  it  allows  the  struc¬ 
ture  of  documents  to  be  defined  solely  in  terms  of  their 
logical  structure,  independent  of  their  subsequent  layout 
SGML,  however,  does  not  yet  offer  definitions  of  document 
types.  These  have  still  to  be  developed  by  means  of  the 
SGML  standard. 

The  first  organization  to  take  steps  towards  defining  docu¬ 
ment  types  was  the  AAP  (Association  of  American  Publis¬ 
her!)  in  the  USA.  In  1986,  the  AAP  developed  an  applica- 
tkm  bused  on  SGML  which  became  an  ANSI  standard.  In 
Germany,  too,  efforts  were  made  to  develop  SGML  appli¬ 
cations  for  the  German-speaking  countries.  In  1987,  the 
Bundesverband  Dnick  (German  Printer's  Association),  the 
Bdrtenverein  ( Association  of  the  German  Book  Trade)  and 
several  publishing  houses,  among  diem  Springer- Veriag, 
and  companies  in  the  printing  industry  started  to  work  on 
StrukTEXT.  On  the  one  hand,  printing  houses  were  not 
equipped  with  the  means  of  converting  data,  so  this  SGML 
application  was  only  used  in  individual  cases,  and  on  the 
other  hand,  only  relatively  few  documents  were  structured 
using  SGML,  so  the  necessary  conversion  programmes 
were  never  developed.  In  short  -  everybody  waited  for 
everybody  else,  and  thus  widespread  and  successful  use  of 
SGML  did  not  occur. 

Ultimaae  success  only  became  possible  with  CALS.  CALS 
(Compurer-Aided  Acquisition  and  Logistics  Support)  is  an 
initiative  of  the  American  Department  of  Defense  (DoD) 
which  aims  at  providingastandardized  coding,  structuring, 
and  exchange  format  for  all  of  the  DoD'r  documents. 
SGML  is  being  used  in  this  project  for  structuring  the 
documents.  When  the  DoD  miounced  that  as  of  September 
19M  all  caDt  for  tenders  and  lenders  are  lo  be  processed 
■ring  Ms  system,  soft-  and  hardware  producers  reacted 
promptly;  about  four  weeks  after  the  decision,  the  first 
products  supporting  SGML  were  put  on  the  market 


are  structured  in  this  neutral 
ey  can  be  easily  and  rationally 


processed  and  reused.  This  means  that  flexibility  in  placing 

typesetting  and  printing  otders  is  increased.  If  a  particular 
printer  or  typesetter  is  not  available,  it  is  easy  to  have  a  new 
edition  produced  by  another  company  at  short  notice, as  the 
data  are  coded,  in  a  way  which  everybody  can  understand 
and  convert  A  second  aid  even  more  important  point  is  the 
possibility  of  multiple  use  of  the  data.  To  have  an  article 
appear  in  more  than  one  journal  does  not  mean  that  more 
keyboarding  will  be  necessary,  even  though  these  journals 
may  have  different  layouts  Mid  may  be  produced  by 
different  printers,  because  with  SGML  only  logical  and 
therefore  layout-independent  document  structures  are  defi¬ 
ned.  To  offer  this  article  in  media  other  than  paper  or  in  a 
data  base  is  then,  of  course,  the  next  step  to  take.  Toproceed 
in  this  way  saves  on  costs  and-  in  some  cases  even  more 
important -  on  time.  That  the  documents  have  a  good  struc¬ 
ture  also  leads  to  the  possibility  of  data  retrieval  by  content- 
related  criteria,  and  new  products  like  printing  on  demand, 
electronic  profile  services  or  hypertext  applications  only 
become  conceivable  with  this  son  of  structure. 


How  to  Proceed 

Firstof  all,  one  or  several  documenttype  definitions  (DTDs) 
are  necessary.  In  this  definition,  the  structure  and  meaning 
of  the  elements  of  a  text  are  fixed.  When  SGML  was 
developed,  great  importance  was  attached  to  the  fact  that 
not  only  computers  but  also  humans  should  be  able  to 
interpret  documents  marked  up  with  SGML.  One  should 
pay  attention  to  this  intention  when  creating  a  DTD. 

After  completion  of  this  work  it  will  be  necessary  to  have  in¬ 
dividual  layout  descriptions  for  the  journal,  book  or  for  any 
other  medium  which  serves  as  the  information  carrier.  This 

layout  description  can  -  or  rather  must  -  be  produced  by 
each  individual  supplier  of  information  since  at  this  point  it 
is  no  longer  smooth  exchange  of  data  which  is  in  question 
but  the  layout  And  layout  remains  individual. 

The  document  type  definition  and  the  layout  description 
will  later  be  freely  available  to  authors  and  keyboarders.  In 
this  way,  it  will  be  possible  for  them  to  produce  their 
documents  in  conformity  with  SGML  while,  at  the  same 
time,  the  layout  description  will  enable  diem  to  get  a  visual 
impression,  on  their  monitors,  of  how  the  document  will 
subsequently  appear  in  the  relevant  journal 


SGML  Activities  of  Springer- Veriag 

If  each  user  were  to  develop  his  own  DTD,  documents 
produced  using  these  DTDs  should  then  theoretically  also 
be  usable  by  third  parties.  However,  in  each  case  it  would 
be  necessary  to  supply  information  about  the  content  of  the 
individual  elements  of  the  structure  mm}  to  change  the  data 
conversion  program.  From  the  economic  point  of  view  this 
is  certainly  netnseftil.  If  rmaralms  fem 
often  write  their  manuscripts  without  knowiiy  in  which 


Advantages  of  Using  SGML 


journal  their  article  will  be  published  or  even  with  which 
pubiijher,  it  becomes  obvious  that  a  DTD  is  necessary 
which  is  not  specific  to  any  ooe  publisher.  For  this  reason, 
at  the  end  ofl989  we  set  up  a  working  group  aimed  at  de¬ 
veloping  and  testing  such  a  DTD.  This  working  group 
consists  of  the  publishers  Springer,  Elsevier,  Kluwer  and 
Thieme,  the  typesetters  Stflru  and  SRZ  Berlin,  the  software 
supplier  sod  consultant  MID  and  the  data  base  host  FTZ. 

A  first  practical  application  of  SGML  is  the  new  edition  of 
a  technical  handbook,  the  "Dubbcl  -  Taschenbuch  for  den 
Maschinenbau”.  With  this  book  we  proceed  as  follows: 

As  a  first  step,  a  document  type  definition  was  made  at 
Springer-Veriag  on  die  basis  of  SGML.  Then,  a  layout 
description  for  the  primed  version  was  developed  so  that  a 
typesetter  could  keyboard  the  data.  As  soon  as  the  typesetter 
has  finished  his  work  Springer-Veriag  will  receive  the 
structured  data.  These  data,  which  are  then  structured  inde¬ 
pendently  of  a  later  layout,  can  easily  be  used  both  for  the 
printed  version  and  for  other  applications  (e.g.  on  CD- 
ROM). 

A  very  important  aspect  is,  of  course,  the  cost.  Although  this 
project  is  not  yet  finished,  first  calculations  have  indicated 
that  expenses  for  typesetting  have  only  slightly  increased. 

Problems  Arising  with  the  Use  of  SGML 

Problems  with  SGML  arise  in  the  field  of  illustrations, 
tables  and  mathematics.  The  standard's  approach  for  ma¬ 
thematics  is  fir  from  bong  sufficient  for  scientific  texts 
with  many  formulas.  Springer-Veriag  has  therefore  decided 
to  use  TeX,  which,  in  the  meantime,  has  almost  become  a 
standard  far  mathematical  formulas.  In  most  cases  tables 
also  can  not  be  written  by  means  of  SGML.  For  figures  the 
problem  lies  in  the  quality.  There  are  standard  exchange 
formats  like  TIFF or  CCTTTGroup4  Facsimile,  but  for  high 
quality  halftone  figures  none  of  these  standards  are  suffi¬ 
cient  A  further  item  which  is  still  unresolved  is  the  layout 
description.  No  functioning  standards  are  available  yet  so 
that  at  the  moment,  individual  solutions  are  necessary. 

Things  to  Do  Next 

The  most  important  thing  for  a  publisher  is  that  authors 
accept  the  use  of  SGML.  Most  of  the  presently  available 
word  processing  systems  do  not  support  SGML  so  that 
menu-driven  text  processing  on  the  basis  of  a  document 
type  definition  is  most  often  not  possible.  Developments  in 
this  field  are,  once  again  due  to  the  impact  of  CALS,  fore¬ 
seeable. 


Rather  developments  will  happen  in  the  field  of  automati¬ 
cally  converting  data  produced  on  systems  like  Microsoft 
WORD  or  TeX  to  SGML  structures  and  Springer-Veriag 
will  certainly  continue  to  contribute  to  these  developments 
as  well. 
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Telecommunications  is  transportation  of 
information.  Before  discussing  the 
capacities  of  the  emerging  transportation 
systems  for  this  type  of  'goods*,  let  us  look 
at  different  services  and  establish  a  set  erf 
references,  both  for  volume  of 
infonuatina,  which  is  expressed  in  bits  or 
Megabits,  and  for  iafocmatfon  ****** 
rates  or  channel  capacities,  which  we  can 
express  in  bin  per  second,  bitA- 

We  shall  in  this  papa-  look  at  the 
highways  for  information,  the  broadband 
communications  channels  and  systems,  and 
we  shall  take  this  to  mean  systems  where 
the  transmission  requires  a  considerable 
higher  bandwidth  or  bit  rate  than  what 
can  be  accomplished  via  the  ordinary 
analogue  telephone  system. 

Materials  for  broadband  communication 
include  data  files,  high  resolution  graphics 
(including  three-dimensional  and  animated 
graphics),  documents  or  images,  and 
moving  pictures. 


H—'EPfXlgJp) 

l-l 

Using  the  binary  algorithm  the 
information  per  symbol  is  expressed  in  bit 
(feinary  digU). 

The  required  capacity  of  a  channel  for 
transmitting  R  symbols  per  second  is 
therefore  RxH  bit/s. 

The  data  rates  associated  with  different 
types  of  signals  differ  enormously,  and  this 
is  important  to  keep  in  mind  when 
discussing  the  possibilities  and  limitations 
of  telecommunication  systems. 

CONTINUOUS  TRANSMISSION 

The  resources,  such  as  bandwidth,  transmit 
power,  antenna  area,  required  to  establish 
a  connection  are  proportional  to  the  data 
rate. 


Broadband  communications  will  extend  the 
prominent  role  the  image  now  plays  in 
newspapers,  books,  brochures,  slides  and 
movies,  in  the  private,  educational  and 
business  sector/ 


INFORMATION  AS  AN  ENGINEERING 
CONCEPT 

It  was  Claude  Shannon  who  gave  us  the 
mathematical  tool  to  treat  information 
quantitatively.  The  information  contents  of 
a  symbol  is  related  to  the  uncertainty  of 
the  particular  information  symbol,  Le.  to 
its  probability  of  occurence.  The  average 
information  per  symbol  from  a  source 
which  eutits  symbol  a,  from  an  alphabet  of 
N  such  symbols,  when  the  probability  of  a, 
is  p»  is  jivea  bf  the  repression  which  is 
almost  as  ftndameatal  as  the  frames 
E  >m  c*. 


For  continuous  transmission  of  information 
the  two  main  types  of  signals  are  sound,  in 
particular  speech,  and  live  pictures. 

Table  1  shows  data  rates  for  CCITT 
standardized  speech  coding  methods, 
reflecting  different  quality  requirements2. 


CCITT 

Recomm. 

Analogue 

bandwidth 

(kHz) 

Data  rate  | 
(kbit/s) 

G.721 

3.1 

32 

0.711 

3.1 

64 

J.  42 

7 

192 

J.  41 

15 

384 

Table  1.  Data  rates  for  speech. 
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Quality 

Data  rate 
(Mbit/s) 

Description 

A 

92  •  200 

High  definition  TV  | 

B 

30  -  145 

i 

Digital  component-coding  signal 

C 

20  -  40 

Digitally  coded  NTSC,  PAL,  SECAM 

D 

0384  -  1.92 

Reduced  spatial  resolution  and  movement  portrayal. 

E 

0.064 

Highly  reduced  spatial  resolution  and  movement 
portrayal. 

Table  2.  Data  rates  for  live  pictures. 


Live  pictures  require  considerably  higher 
channel  capacities,  typically  by  a  factor  of 
1000. 

A  data  rate  of  64  kbit/s  is  sufficient, 
however,  for  limited  resolution  and  limited 
movement,  but  the  low  data  rate  requires 
very  advanced  signal  processing.  On  the 
top  end  200  Mbit/s  or  higher  is  required 
for  High  Definition  TV,  HDTV.  This  is 
shown  in  Table  2. 


DOCUMENT  TRANSMISSION 

We  shall  here  use  the  term  document  in  a 
wide  sense,  and  it  is  then  understood  to 
comprise: 

-  text  in  alphacode, 

-  64  kbit/s  voice  in  the  form  of 
annotations 

-  rasterscanned  graphics  in  high 
resolution. 

Text  in  alphacode  is  moderate  in  data 
volume.  This  paper,  except  the  figures,  is 
contained  in  a  data  file  of  less  than  1/3 
Mbit,  all  control  characters  included. 

Spoken  messages  requires  i  higher  data 
volume.  With  standard  encoding  the 
present  text  in  alphacode  requires  the  same 
storage  capacity  as  about  3  seconds  of 
speech. 


Pictures  turned  into  numbers,  however, 
require  vast  amounts  of  information,  as 
shown  in  Table  3.  Screen  images  of  1280  x 
1024  pixels  and  24  bit  of  colour  resolution 
require  1.28  x  1.024  x  24  10*  bit,  which  is 
more  than  30  Mbit  per  picture.  With 
updating  at  a  rate  25  to  30  pictures  per 
second,  this  will  require  a  channel  of  about 
800  Mbit/s! 


Data  file  or  image 

Data 

volume  in 
Mbit 

A4  facsimile 
(black/white) 

1  -  4 

A4  facsimile  (gray) 

9  -  16 

A4  facsimile 
(colour) 

30  -  60 

Colour  television 
image 

4  -  6 

High-definition 
colour  television 
image 

16-24  | 

Newspaper  page 

200  -  600 

High-resolution 
computer  graphics 

20  -  100 

Relatively  large 
data  file 

several  1 

100  \ 

Table  3.  The  information  contents  of 
documents. 
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THE  IMPORTANCE  OF  HIGH 
TRANSMISSION  CAPACITY 

For  the  transmission  of  document  with  a 
given  data  volume  there  is  a  simple 
relationship  between  channel  capacity  and 
transfer  time,  as  shown  in  Fig.  I3. 


it  is  essential  use  channels  with 
considerably  higher  capacities  than  what 
can  be  provided  in  the  analogue  telephgone 
network,  or  even  in  the  IDN. 

To  establish  usable  highways  for 
information  it  is  necessary  to  have 
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-  physical  channels  with  sufficient 
capacity 
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Fig.  1.  Data  transmission  nomogram. 

The  present  PSTN,  the  Public  Switched 
Telephone  Network,  is  designed  primarily 
for  voice,  and  has  a  very  limited  capacity 
for  data  transfer.  Without  using 
sophisticated  signal  processing  methods 
the  data  rate  will  be  in  the  region  2400  to 
4800  bit/s. 

We  see  now  a  gradual  transition  to  what  is 
called  the  IDN,  the  Integrated  Digital 
Network,  where  speech  is  transmitted  at 
the  standard  rate  of  64  kbit/s.  Access  to 
this  data  stream  for  a  user-to-user 
communication  will  then  increase  the 
capacity  by  a  factor  of  10  to  20. 

From  Fig.  1  it  is  obvious  that  when  we  go 
to  documents  with  high  data  volumes, 


-  a  suitable  network  structure  that  allows 
links  with  desired  qualities  to  be 
established  between  arbitrary  users. 

We  shall  first  look  at  the  physical 
channels. 


TECHNOLOGICAL  DEVELOPMENT 

The  subscriber  lines  of  the  present  PSTN 
represent  a  large  investment,  and  the  cost 
of  each  line  is  carried  by  a  single  user.  It 
appears  that  subscriber  lines  of  the  twisted 
wire  type  can  also  be  used  for  digital 
transmission  with  data  rates  up  to  the 
hundred  kilobit  per  second  range  for 
distances  of  some  kilometers. 

Trunk  routes  are  required  to  carry 
considerably  higher  data  rates,  but  these 
are  economically  less  critical  since  they  are 
shared  among  several  users.  The 
traditional  transmission  systems  are  based 
on  the  use  of  coaxial  cables,  which  provide 
capacities  in  the  100  Mbit/s  range,  and 
microwave  radio  relay  systems,  which 
today  are  constructed  on  a  large  scale  with 
capacities  of  140  Mbit/s. 

Satellites  have  the  unique  feature  that  they 
cover  large  areas,  and  that  they  establish 
links  between  arbitrary  ground  stations 
within  that  area.  The  cost  of  the  earth 
station,  which  is  deminishing  as  a  result  of 
the  technology  development,  must  be 
carried  by  each  subscriber,  but  the  space 
segment  is  a  common  cost  for  the  whole 
coverage  area.  Data  rates  are  also  very 
flexible  up  to  the  several  Mbit/s  range. 

The  real  break-through  for  low  cost  data 
transmission  is  the  introduction  of  optical 
fibres,  which  have  an  enormous  capacity, 
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in  the  Terabit  (thousand  megabit)  per 
second  range.  The  capacity  of  lightwave 
systems  has  been  increasing  rapidly  and 
nearly  doubling  itself  every  18  months  over 
the  last  ten  years.  Optical  fibers  of  today 
operate  at  data  rates  of  typically  45  to  560 
Mbit/s,  providing  600  to  80  000  voice 
channels  per  fibre.  The  repeater  distance  is 
in  the  range  of  1  to  450  km. 

At  the  same  time  the  cost  of  the  fibers 
have  been  decreasing.  As  a  rule  of  thumb, 
cost  per  bit  falls  by  a  factor  of  three  each 
time  the  bit  rate  has  increased  by  a  factor 
of  four.  The  cost  per  meter  of  single-mode 
fiber  dropped  from  five  dollars  in  1982  to 
twenty-five  cents  in  1988  and  is  predicted 
to  drop  to  four  cents  by  19934. 

The  emergence  of  very  long  span  high-bit- 
rate  optical  systems  has  reduced  bit 
transport  cost  to  a  point  where  bandwidth 
efficiency  is  no  longer  an  important 
parameter.  This  will  have  as  a  consequence 
that  time  and  location  loose  their 
significance  for  many  activities  connected 
with  information  processing. 


TELECOMMUNICATION  NETWORK 
DEVELOPMENT,  THE  ISDN 

The  telecommunication  networks  of  today 
form  a  conglomerate,  because  they  were 
each  constructed  for  a  particular  service. 
They  are  partly  interlinked,  and  they  make 
now  use  of  many  of  the  same  physical 
resources,  but  they  form  separate 
networks. 

One  obvious  consequence  of  the 
information  theory  is  that  a  single  net  is 
sufficient  for  all  types  of  services.  That 
has  lead  to  the  idea  of  the  ISDN,  the 
Integrated  Services  Digital  Network,  as 
illustrated  in  Fig.  2. 

ISDN  represents  a  standardization  of 
services  in  the  user  interface.  Different 
types  of  equipment  can  be  connected  to  the 
user  interface.  At  the  same  time,  within 
the  network  there  will  be  a  variety  of 
switching  and  transmission  functions.  The 
ISDN  is  developed  from  the  PSTN  through 
the  IDN.  The  transfer  to  the  ISDN  will 


then  take  place  through  the  introduction 
of  digital  subscriber  lines  and  more 
advanced  signalling,  that  will  broaden  the 
types  of  services. 

There  are  two  types  of  user  intefaces 
defined  for  the  ISDN5: 

-  2  voice  channels  at  each  64  kbit/s  and  1 
data  channel  at  16  kbit/s,  the  (2  B  +D)  or 
basic  rate  access  at  144  kbit/s  for 
individual  users.  The  data  channel  is 
mainly  intended  for  signalling.  Up  to  eight 
terminals  can  be  connected  to  a  passive 
bus. 

-  30  voice  plus  1  datachannel  (30  B  +D) 
at  about  2  Mbit/s  for  the  business  user,  the 
primary  rate  access.  This  is  mainly 
intended  for  point-to-point  links  between 
PABXs  (Private  Automatic  Branch 
Exchange). 

In  addition  to  the  64  kbit/s  and  the  16 
kbit/s  channels  also  two  wideband 
channels  have  been  defined.  Ho  at  384 
kbit/s  and  H12  at  1920  kbit/s. 

The  64  kbit/s  channels  can  be  used  for  3.1 
kHz  voice,  as  in  the  PSTN,  but  they  can 
also  be  used  for  other  services  such  as: 

-  data  transmission 

-  multiplexing  of  several  low  speed  data 
channels 

-  access  to  the  data  networks,  both  to 
circuit  switched  and  the  packet  switched 
services. 

-  high  speed  (Group  4)  telefax  which 
provides  higher  quality  and  lower 
transmission  time. 

The  ISDN  will  also  ensure 
interconnectivity  with  other  networks,  e.g. 
the  PTSN,  to  the  highest  possible  degree, 
and  the  ISDN  networks  of  other  countries. 
It  is  anticipated  that  'compatible''  ISDN- 
services  in  most  industrialized  countries, 
mainly  in  Europe,  Japan  and  North- 
America,  will  be  available  from  1992. 


Fig.  2.  The  Integrated  Services  Digital  Network. 


The  ISDN  system  as  concept  has  some 
s;  ecific  and  important  properties  that  are 
worth  while  emphasizing: 

-  The  development  of  traditional 
commmunication  networks  is  a  very  slow 
process.  Therefore,  the  development  and 
introduction  of  new  services  has  also  been 
a  slow  process.  ISDN  may  change  this 
situation  since  ISDN  allows  network  and 
services  with  associated  subscriber 
equipment  to  be  developed  independantly 
of  the  implementation  of  the  network. 

•  ISDN  will  allow  new  services  to  be 
introduced  at  marginal  cost,  and  services 
not  demanded  to  be  discontinued  at 
marginal  losses. 

On  the  negative  side,  the  developement  of 
ISDN  is  partly  in  conflict  with  network 
optimization.  Yon  cannot  develop  a 
universal  machine  that  is  optimal  for  every 
particular  application. 


ISDN  makes  interconnection  simpler,  but 
the  subscriber  equipment  could  be  more 
complex  and  costly. 

It  is  therefore  possible  that  viable  new 
services  developed  within  the  ISDN  may  be 
transferred  to  specialized  networks 
optimized  for  that  particular  service. 


DEVELOPMENT  OF  WIDEBAND 
NETWORKS 

It  is  important  to  note  that  the  ISDN  as 
described  above  can  handle  a  wide  variety 
of  services,  including  live  pictures,  within 
64  kbit/s.  Nevertheless,  there  is  a 
development  trend  towards  higher 
transmission  rates,  towards  the  network 
that  is  usually  referred  to  as  the  BISDN, 
the  Broadband  ISDN,  to  meet  the  emerging 
requirements  for  high  speed  document 
transfer. 


For  development  of  a  general  network  the 
most  appropriate  technology  is  the  optical 
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fibre,  and  for  development  of  specialized 
networks  it  is  satellite. 

Some  of  the  work  towards  the  definition  of 
the  BISDN  has  been  done  within  the 
framework  of  SONET,  the  Standard 
Optical  Network,  which  defines  standard 
optical  signals,  a  synchronous  frame 
structure  for  the  multiplexed  digital 
traffic,  and  operations  procedures*. 

The  standard  was  initiated  in  the  US,  and 
was  extended  to  become  an  international 
standard  through  the  CCITT  (Comite 
Consultative  de  Telegraphie  et  Telephonie) 
where  the  first  recommendation  was 
adopted  in  June  1988. 


The  levels  of  the  SONET  Signal  Hierarchy 
are  given  in  Table  4. 


Level 

Line  Rate  1 

(Mbit/s)  |j 

OC-1 

51.84  | 

003 

155.52  | 

009 

466.56  | 

OC-12 

622.08  | 

0018 

933.12 

0024 

1244.16 

0036 

1866.24 

0048 

2488.32 

Table  4,  SONET  hierarchi  and  data  rates. 


When  the  BISDN  is  concerned  the 
capabilities  must  be  flexible.  Interfaces  of 
about  150  to  600  Mbit/s  have  been 
proposed.  The  usable  capacity  of  an 
interface  can  be  organized  by  Synchronous 
Transfer  Mode  (STM)  or  by  Asynchronous 
Transfer  Mode  (ATM)  techniques. 

STM  extends  narrow  band  ISDN  concepts, 
as  shown  in  Fig.  3(a)7. 

With  the  ATM  technique  all  services  could 
be  carried  over  a  single,  integrated, 
high-speed  packet-switching  fabric  ATM 
interfaces  are  extremely  flexible,  using 


signalling  to  dynamically  reconfigure  a 
mix  of  logical  channels,  as  illustrated  in 
Fig.  3(b). 


USER  EQUIPMENT 

Also  the  user  equipment  is  undergoing  a 
development  which  means  that  the  new 
capabilities  of  the  wideband  networks  can 
be  utilized. 

Today,  a  typical  PC  operates  at  2  to  8 
MIPS  (million  instructions  per  second), 
contains  a  few  megabytes  of  memory  and 
offers  "effective"  back-plane  bus  and  I/O 
rates  of  a  few  megabytes  per  second. 

Emerging  PCs  and  workstations  will 
operate  at  20  to  100  MIPS,  have  extensive 
memory  capacity,  and  I/O  rates  near  100 
megabytes  per  second. 

High  end  machines  operate  at  500 
MFLOPS  (million  floating  point  operations 
per  second)  to  2  GFLOPS,  have  up  to  2GB 
of  memory,  and  channels  in  the  100  Mbit/s 
range. 

Machines  under  development  will  operate 
at  10  to  40  GFLOPS,  have  upwards  of 
8GBit  of  memory,  and  channel  interfaces 
upwards  of  800  Mbit/s.* 

SPECIALIZED  NETWORKS. 

Satellites  may  be  used  as  a  part  of  the 
general  telecommunication  network,  but 
they  may  also  be  used  to  established 
overlay  networks,  -  user  to  user, 
independent  of  the  general 
telecommunications  systems.  This  type  is 
called  user  oriented  satellite  systems. 

For  a  satellite  system,  the  high  capacity 
pipelines  can  be  redirected  instantaneously 
over  the  whole  coverage  area  through  the 
signalling  and  access  control  system. 

The  America  company  AT&T  initiated  in 
May  1986  a  socalled  VSAT  (Very  Small 
Aperture  Terminal)  system  providing 
business  customers  with  digital  data  and 
video  communications. 
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USER  NETWORK 


USER  NETWORK 


Fig.  3.  BISDN  Synchronous  Transfer  Mode  (STM)  and  Asynchronous  Transfer  Mode 
(ATM). 


One-way  transmission  from  the  hub- 
station  is  offered  at  data  rates  9.6  kbit/s, 

56  kbit/s  and  1.5  Mbit/s.  Two-way 
transmission  is  offered  at  256  kbit/s 
outbound  from  the  hub  and  56  kbit/s 
inbound. 

One  version  of  a  VSAT  system,  which  has 
been  developed  by  ESA  (the  European 
Space  Agency)  is  the  APOLLO-system, 
shown  in  Flg.3.  The  system  is  a  very  good 
illustration  of  its  basic  ideas. 

One  characteristic  feature  of  the  APOLLO 
system  is  communication  links  with 


different  capacity  in  the  different 
directions.  The  forward  direction  is  a  time 
shared  2  Mbit/s  link  from  the  information 
sources  via  a  large  earth  station  to  small 
and  cheap  receiving  stations  at  the  users’ 
premises.  The  returns  could  be  provided  via 
terrestrial  networks  or  via  a  low  data  rate 
satellite  link. 

Another  development  of  the  VSAT  concept 
is  the  SWAN  (Satellite  Wide  Area 
Network),  as  illustrated  in  Fig.  5*,  which 
is  used  to  interconnect  LAN  (Local  Area 
Network)  and  MAN  (Metropolitan  Area 
Networks). 
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Fig.  4.  The  APOLLO  system. 


This  is  a  network  with  a  very  flexible 
architecture,  and  can  be  used  in  different 
forms  of  point-to-point,  star  and  mesh 
configurations. 

One  particular  category  of  applications  for 
this  type  of  system  is  wide  area 
communication  services  and  remote  area 
information  services  of  different  types 
such  as  corporate  data  base  access  and 
public/private  information  database  access. 


POINT  TO  MULTIPOINT  SYSTEMS 

Another  category  of  specialized  networks 
can  be  established  via  broadcasting 
satellites,  which  represent  both  enormous 
investments  and  very  valuable  commercial 
possibilities. 

The  cost  of  the  small  stations  is  largely  a 
question  ov  production  volume.  We  have 
recently  seen  satellite  TV  receivers  sold  in 
shops  with  a  price  tag  of  £199,  everything 
included. 


A  new  signal  format,  MAC  (Multiplexing 
of  Analog  Components),  allocates  about 
10  %  of  the  total  TV  channel  capacity  for 
sound  and/or  data  on  an  interchangeable 
basis.  And  10  %  of  a  TV  channel  is  a 
considerable  capacity  when  used  for  voice 
and  data! 

The  D-MAC  system,  which  is  now  gaining 
the  most  widespread  acceptance,  can 
provide  about  2  Mbit/s  of  data  in  addition 
to  ordinary  TV  program  sound.  (If  the  TV 
system  is  used  for  several  simultaneous 
multi-lingual  commentary  channels,  the 
capacity  is  reduced  accordingly.)  The  data 
can  be  directed  to  defined  user  groups  by 
the  access  control  system,  which  initially 
was  deviced  to  document  the  numbers  of 
viewers  of  a  particular  TV-program,  and  to 
ensure  that  the  viewers  pay  for  the  service. 
The  organization  of  the  access  control  is 
shown  on  Fig.  6,#. 

The  MAC  system  with  access  control,  and 
with  encryption  as  required,  is  a  powerful 
infrastructure  for  distributing  documents 


Fig.  5.  SWAN  architecture. 


and  other  types  of  data.  The  system  can 
also  be  utlized  for  closed  TV-transmission, 
e.g.  to  branch  offices  of  geographically 
widespread  companies. 

Planning  of  sessions,  ordering  of 
documents,  etc  can  be  amanged  via  the 
ordinary  terrestrial  network.  The  biggest 
advantage  of  this  system  is  low  equipment 
cost  The  configuration  could  be  a 
commercial  TV  satellite  receiver,  an 
interface  card  and  a  commercial  PC. 


CONCLUDING  REMARKS 

This  paper  has  aimed  at  giving  an  overview 
of  the  development  of  information 
transportation  systems  with  capacities 
beyond  what  is  available  today  via  the 
public  switched  telephone  network. 

Two  elements  are  directing  the  course  of 
development,  technology  and  business,  and 
the  latter  is  the  more  difficult  one.  We  are 


today  able  to  provide  technical  solutions  to 
almost  any  problem.  The  real  challenge  is 
to  develop  services  that  are  of  value  to  the 
user,  and  that  the  user  is  willing  to  pay 
for. 

Service  development  is  a  difficult  process, 
since  the  value  of  a  service  in  many  cases 
depends  on  the  penetration  of  its  use.  We 
have  also  the  situation  that  new  services 
require  new  equipment,  which  will  be 
developed  and  mass  produced  only  if  there 
is  a  service  creating  sufficient  demand. 

This  circle  is  not  easy  to  break. 

The  ISDN,  which  allows  a  new  service  to 
be  introduced  at  marginal  cost,  is  an 
important  tool  for  testing  out  the  value  of 
further  service  to  the  users. 
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1 .  Introduction 


Natural  Language  Processing  (NLP)  is  almost 
as  old  as  computers.  Tha  first  applications 
of  computers  for  manipulating  linguistic 
data  were  those  that  were  the  most  easy  to 
make:  statistics,  sorting  (establishing 
word  lists  from  text) ,  concordances  etc.  In 
brief,  the  results  were  interesting  for 
researchers  as  they  provided  new  data  on 
language  and  linguistics,  and  much  faster 
than  humans  could  do.  Some  results  were 
however  readily  usable  also  by  people  out¬ 
side  the  researchers'  world:  all  those  ha¬ 
ving  to  do  with  coding  and  decoding  mes¬ 
sages.  As  you  all  know,  computers  were  used 
for  this  purpose  already  during  World  War 
II.  I  will  not  go  further  into  this  field 
of  application. 


Since  the  late  40'les  and  the  early  50'ies 
many  fields  of  application  have  developed, 
supported  by  progress  in  computational  lin¬ 
guistic  research  and  in  production  of  com¬ 
puters,  and  lately  also  by  progress  in  the 
field  of  Artificial  Intelligence  (AI)  .  A 
real  breakthrough  of  the  industrial  ap¬ 
plication  of  NLP  has  however  not  yet  taken 
place.  Foreseeing  this  breakthrough  some 
people  have  created  a  new  term  Language 
Industries,  or  in  French  where  it  was  first 
invented  ”1' Industrie  des  langues". 


We  can  give  no  exact  date  for  when  this 
industry  will  become  profitable,  but  it 
seems  obvious  that  it  will  happen  in  the 
90'ies. 


In  this  paper  I  have  chosen  to  treat  a  num- 
b*r  fields  of  application,  each  of  them 
by  at  least  one  case  example .  The  paper 
does  not  postulate  to  cover  all  fields  of 
application,  nor  to  cover  any  application 
exhaustively,  -  for  obvious  reasons. 


Terminological  remark: 

The  scientific  community  does  not  have  one 
single  opinion  on  the  relationship  between 
NLP  and  AI.  NLP  people  seem  to  believe  that 
there  is  a  field  called  NLP  which  exists 
independent  of  AI,  and  which  can  take  over 
methods  and  results  from  AI.  Most  AI  people 
however  seem  to  think  that  AI  is  a  wide 
field  covering  e.g.  NLP.  The  wording  of  the 


title  of  this  paper  shows  that  the  latter 
view  is  not  followed  here.  Still  this  does 
not  solve  the  problem  of  defining  AI .  In 
the  present  paper  a  rather  relaxed  inter¬ 
pretation  of  the  field  is  assumed,  e.g. 
that  one  deals  with  conceptual  structures 
or  objects,  that  some  kind  of  implication 
or  inferencing  is  used,  or  similar.  In  or 
der  to  get  examples  from  a  variety  of  NLP 
fields  even  this  relaxed  view  is  not  always 
adhered  totally  to. 


2,  Machine  Translation  (MT) 


A  major  application  field  for  NLP  is  the 
field  of  translation.  It  is  also  one  of  the 
very  first  fields  of  application:  the  Geor¬ 
getown  system  which  translated  Russian  - 
English  was  presented  in  1954,  and  a  varie¬ 
ty  of  systems  are  being  marketed  at  pre¬ 
sent.  Translation  is  however  a  very  complex 
task,  and  at  present  no  general-purpose  MT 
system  exists  which  produces  good  quality 
translation.  This  will  continue  to  be  true 
for  some  time.  The  reason  that  MT  is  anyway 
one  of  the  major  applications,  is  that  the¬ 
re  is  such  a  need  for  translation  that  a 
lass  than  ideal  output  can  still  be  useful. 
In  order  to  give  a  state-of-the-art  picture 
of  the  field  of  MT,  I  have  chosen  to  de¬ 
scribe  two  MT  projects,  a  European  and  an 
American  one,  and  to  mention  a  few  more. 


2.1.  The  task  of  translating  by  computer 


In  the  early  days  of  machine  translation 
the  translation  task  was  actually  seen  as  a 
variant  of  the  decoding  tasks  mentioned 
above.  Secondly,  there  was  a  feeling  that 
translation  could  more  or  lass  be  reduced 
to  consulting  a  bilingual  dictionary,  and 
making  a  few  (local)  word  order  transforma¬ 
tions. 


The  difficulties  observed  with  these  sy¬ 
stems  led  to  the  development  of  second  ge¬ 
neration  MT  projects  where  the  translation 
process  was  broken  down  into  either  two  or 
three  steps: 
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The  difference  between  these  two  approaches 
is  that  the  former  assumes  an  interlingua, 
i.e.  a  (formal)  language  which  has  the  ex¬ 
pressive  power  of  any  natural  language  (or 
at  least  those  involved) ,  whereas  the  lat¬ 
ter  is  also  modular,  but  maintains  a  trans¬ 
lation  phase  proper,  the  transfer  phase. 
There  are  no  examples  of  general  transla¬ 
tion  systems  of  significant  size  using  the 
interllngua  approach,  one  reason  being  that 
a  useful  interlingua  has  not  been  defin¬ 
able,  but  recently  a  side  version  of  this 
has  emerged:  the  DLT  in  which  the  artifi¬ 
cial  (but  not  formal)  language  esperanto  is 
used  as  inter lingua.  The  original  idea  of 
an  interlingua  -  if  at  all  feasible  -  would 
undoubtedly  make  the  use  of  AI  necessary. 

The  transfer  approach  has  been  used  widely, 
with  emphasis  on  various  aspects. 

2.2.  EUROTRA 


The  Council  of  the  European  Conmunities 
decided  in  1982  to  start  a  European  (EEC) 
project  in  MT,  covering  all  the  EEC  offi¬ 
cial  working  languages,  financed  jointly  by 
the  Commission  of  the  European  Communities 
and  the  member  states.  The  project  as  such 
is  ending  at  the  end  of  this  year,  but  fol¬ 
low-up  actions  are  being  prepared. 

The  novel  aspect  of  EUROTRA,  when  it  was 
first  conceived  is  the  fact  that  it  is  in¬ 
herently  multilingual  where  other  systems 
which  can  treat  several  language  pairs 
(e.g.  SYSTRAN  and  METAL)  normally  develop 
each  language  pair  more  or  less  separately. 

EUROTRA  is  transfer  based.  The  result  of 
the  analysis  phase  is  an  Internal  represen¬ 
tation  of  the  source  text.  In  the  transfer 
phase,  the  internal  representation  of  the 
source  text  is  translated  into  an  equi¬ 
valent  internal  representation  of  the  tar¬ 
get  text.  This  translation  step  basically 
consists  in  exchanging  lexical  units,  but 
in  some  cases  a  change  of  grammatical  rela¬ 
tions  is  also  necessary.  Finally,  the  syn¬ 
thesis  phase  produces  a  surface  target  lan¬ 
guage  text  from  the  internal  representa¬ 
tion. 

The  advantage  of  this  modular  structure  of 
the  overall  translation  process,  is  that 
the  same  analysis  module  can  be  used  for 
all  eight  target  languages.  Similarly,  the 
same  synthesis  module  is  used  for  all  sour¬ 
ce  languages,  leaving  to  the  transfer  modu¬ 
le  only  to  be  bilingual. 

In  a  transfer  system  which  is  multilingual 
there  is  a  clear  interest  in  obtaining  as 
small  transfer  modules  as  possible,  i.e.  in 
pushing  as  much  of  the  work  as  possible 
into  the  monolingual  modules.  This  means 
that  the  research  will  aim  at  producing 
internal  representations  for  exchange  which 
are  as  interlingual  as  possible,  thereby 
eliminating  the  need  for  explicit  transfer 
rules  which  change  an  object  (a  lexical 
item  or  a  feature  or  a  structure) .  The 
greater  part  of  the  linguistic  research 
done  within  EUROTRA  is  aimed  at  improving 
the  interlinguality  of  the  internal  repre¬ 
sentations. 

All  the  pieces  of  knowledge  which  are  used, 
ere  essentially  of  linguistic  nature.  In 
the  follow-up  programme  it  will  be  invest!- 


gated  bow  extra-linguistic  knowledge  can 
also  be  used  for  e.g.  disambiguation  pur- 
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poses. 

Fffff-t? 

At  Carnegie  Mellon  University  the  KBKT-89 
project  ran  from  1988  to  1989.  This  machine 
translation  project  translates  between  Eng¬ 
lish  an  Japanese  (in  both  directions) .  It 
has  two  major  characteristics:  it  is  know¬ 
ledge  based  and  it  uses  the  interlingua 
approach  to  KT. 

The  interlingua  is  a  structure,  describing 
the  semantic  relations  between  constituents 
of  the  text.  The  meaning  of  the  words  of 
the  text  are  given  by  an  ontology  or  domain 
model  which  contains  1500  concepts. 

Apart  from  the  traditional  syntactic  parser 
and  syntax  to  semantics  mapping,  and  apart 
from  the  (static)  knowledge  expressed  in 
the  ontology,  the  KBMT  contains  an  episodic 
or  dynamic  memory  which  establishes  e.g. 
instantiation  links  between  objects  or 
events . 

Following  the  traditional  interlingua  idea 
source  and  target  language  expressions  are 
never  compared  in  KBMT,  the  concepts  and 
semantic  relations  being  the  canmon  part . 

As  described  above  the  interlingua  approach 
is  quite  ambitious,  and  it  will  be  extreme¬ 
ly  interesting  to  see  extensions  of  the 
KBMT  approach  to  a  larger  coverage,  so  it 
is  a  project  to  be  followed.  It  should  be 
mentioned  as  conclusion  that  the  KBMT- 8 9  is 
a  research  project,  not  aiming  at  a  practi¬ 
cal  MT  system.  Future  projects  may  also 
include  application. 

2-*—  Interllncua  or  transfer? 

This  discussion  is  still  ongoing  among  MT 
researchers.  It  seems  to  me  that  the  most 
clear  advantage  of  the  interlingual  ap¬ 
proach  is  that  it  makes  the  tedious  trans¬ 
fer  writing  superfluous.  And  the  most  clear 
disadvantage  is  that  it  is  at  present  not 
possible  to  define  an  interlingua  for  more 
than  very  restricted  sublanguages.  The 
KBMI-89  researchers  agree  with  these  two 
statements,  but  claim  further  that  transfer 
MT  requires  post-editing  while  interlingual 
MT  deos  not.  In  my  opinion  all  MT,  if  trea¬ 
ting  more  than  extremely  restricted  sub¬ 
languages,  will  require  post -editing  and 
this  for  some  time  to  come. 

While  is  seems  pretty  obvious  that  it  is  a 
goal  of  transfer  systeme  to  make  the  trans¬ 
fer  component  as  small  as  possible,  it  is 
still  not  clear  that  the  optimal  strategy 
is  to  make  the  transfer  component  disappear 
totally.  Also  the  fact  that  knowledge  bases 
and  inference  techniques  are  used  does  not 
automatically  entail  the  interlingual  ap¬ 
proach.  8till,  a  good  deal  of  research  into 
the  theory  of  machine  translation  and  into 
the  way  the  various  approaches  relate  to 
such  a  theory  la  needed.  In  fact  this  sub- 
3,ct  !•  attracting  interest  by  researchers 
in  the  field. 

?■  gowmtsr  assisted  Instruction 

Another  field  of  application  of  nt  la  Com- 

Assisted  Instruction  (CAX) .  Computers 


have  been  seen  as  possible  tools  for  in¬ 
struction  for  more  than  20  years.  The  early 
CAI  programs  would  teach  students  e.g.  for¬ 
tran,  by  presenting  statements  and  quest¬ 
ions  to  the  student.  The  answers  would  be 
checked  against  the  answers  stored  and  va¬ 
rious  actions  taken:  the  student  would  get 
feedback  and  the  teaching  program  would 
continue  with  the  next  question  or  go  back 
and  repeat  some  early  information  and  ques¬ 
tions  belonging  to  this.  I.e.  each  program 
would  itself  contain  the  strategy  for  the 
teaching  expressed  in  the  control  structure 
of  the  programming  language,  and  the  data 
(quastions  and  answers)  in  some  language. 

The  next  step  was  the  creation  of  socalled 
author  languages  for  writing  teaching  pro¬ 
grams  (e.g.  COURSEWRITER) .  This  would  free 
the  author  from  expressing  the  control  of 
the  teaching  event  in  a  programming  lan¬ 
guage  like  pascal  or  fortran,  which  was  a 
big  stap  forward.  On  the  other  hand  what 
has  almost  always  hampered  the  development 
and  use  of  these  CAI  systems  has  been  the 
poor  treatment  of  natural  languageand  au¬ 
thor  language  have  normally  not  offered 
much  support  in  this  area. 

The  problem  is  that  all  text  to  appear  ei¬ 
ther  as  instruction,  as  question  or  as  an¬ 
swer  has  been  stored  in  a  fixed  format.  For 
the  computer  part,  the  instructions  and  the 
questions,  this  is  not  so  much  of  a  pro¬ 
blem.  The  problem  arises  with  the  treatment 
of  the  human  part,  the  answers.  If  there  is 
no  linguistic  treatment  of  answers  at  all, 
the  program  has  to  be  written  in  a  way  that 
only  very  concrete  and  short  answers  are 
needed,  e.g.  "What  is  the  capital  of  Nor¬ 
way?"  "Oslo",  or  "How  many  countries  are 
there  in  the  EEC?”  "12”.  This  puts  a  limit 
to  what  subjects  can  be  treated  by  CAI. 

Even  in  this  very  restricted  scenario,  how¬ 
ever,  authors  of  programs  realise  that  some 
relaxation  or  intelligence  is  needed:  if 
the  program  is  not  a  spelling  exercise, 
some  deviation  from  normal  spelling  should 
be  accepted,  both  capital  and  small  let¬ 
ters,  . and  typographical  errors,  and  the 
choice  between  ”12”  and  "twelwe”  etc.  There 
has  been  some  extensions  to  this  type  of 
answer  handling,  but  still  real  full  scale 
NLP  using  parsers  and  grammars  for  natural 
language  is  not  used.  The  reason  for  this 
is  undoubtedly  that  the  requirements  of 
robustness  and  correctness  of  the  grammars 
involved  can  hardly  be  met.  Since  it  is  a 
claim  of  Language  Industry  that  such  gram¬ 
mars  with  a  reduced  scope  and  function  can 
now  be  made,  this  may  change  in  the  near 
future.  On  the  other  hand  such  intelligent 
answer-handlers  will  only  be  of  real  use 
when  they  are  not  only  working  correctly, 
but  also  efficient  in  terms  of  computer 
time  and  space. 

When  we  see  these  years  a  growing  interest 
in  CAI,  it  is  not  because  of  the  inte¬ 
gration  of  large-scale  NLP  components.  Ra¬ 
ther  the  revival  of  this  application  stems 
from  the  fact  that  1)  in  many  countries  or 
areas  of  countries  it  is  the  best  way  of 
spreading  knowledge,  2)  computers  are  be¬ 
coming  cheaper,  3)  not  all  subject  fields 
need  linguistic  expression  equally  much  and 
by  ingenious  use  of  various  techniques  good 
interfaces  can  be  obtained.  This  leaves  a 
challenge  for  NLP  people. 
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3.1.  Extort 

Sow*  research  hats  been  carried  out  to  in¬ 
vestigate  the  use  of  expert  systems  or  ex¬ 
pert  system  technology  for  CAI  purposes. 
This  is  probably  the  most  obvious  way  of 
including  A1  techniques. 

An  expert  system  contains  knowledge  about  a 
given  domain,  and  inferencing  rules  which 
express  the  various  types  of  implications 
between  pieces  of  knowledge.  An  expert  sy¬ 
stem  for  teaching  needs  to  contain  all  the 
knowledge  about  the  domain,  but  also  know¬ 
ledge  about  how  to  correct  errors.  An  ex¬ 
perimental  teaching  system  CAPRA  for  teach¬ 
ing  introductory  programming  concepts  to 
novice  programmers  has  been  developed  in 
Spain  (Garijo  at  al.  1989) .  The  system 
builds  a  program  to  solve  a  specific  pro¬ 
blem,  following  a  programming  methodology. 
It  is  able  to  explain  the  reasoning  process 
step  by  step,  showing  the  knowledge  used  at 
every  decision  point.  This  is  the  expert 
system  part  of  the  CAI  program.  This  part 
is  combined  with  a  module  that  supervises 
the  interactive  process  between  the  student 
and  the  expert  system. 

The  supervision  module  uses  a  natural  lan¬ 
guage  interface  for  communication  with  the 
user;  it  explains  problems  and  proposes 
exercises,  depending  on  the  student's  per¬ 
formance  . 

The  research  prototype  works  but  still 
there  is  some  way  to  go  before  this  metho¬ 
dology  will  be  widely  applied.  One  should 
also  note  that  introductory  programming 
concepts  is  one  of  the  easier  fields  to 
describe  exhaustively  in  a  knowledge  base, 
and  that  it  will  definitely  take  some  time 
before  complex  domains  can  be  described 
adequately,  Just  like  for  expert  systems  in 
general. 

Research  projects  as  the  one  described  are 
interesting  and  seam  promising  for  the  in¬ 
tegration  of  expert  systems  into  CAI. 

An  interesting  point  put  forward  by  expert 
system  researchers  is  the  fact  that  there 
is  actually  a  difference  between  expert  sy¬ 
stems  and  domain  components  of  intelligent 
CAI  systems.  The  difference  being  that  the 
reactions  and  implications  to  be  made  by 
the  expert  system  depends  on  whether  the 
user  is  an  expert  or  a  student.  This  means 
that  the  factual  knowledge  can  be  common  to 
the  two  kinds  of  use,  whereas  the  proce¬ 
dural  or  strategic  knowledge  will  differ 
according  to  the  use  (Kamsteeg  and  Blerman, 
1989) .  Consequently,  this  puts  a  constraint 
on  how  the  knowledge  is  structured  in  the 
expert  system  if  one  wants  to  use  it  also 
for  teaching  purposes. 

3.3.  A  challenge 

One  very  appealing  and  yet  unexplored  idea 
is  to  view  an  MLR  grammar  and  dictionary  of 
a  language  as  a  knowledge  base,  and  to  cre¬ 
ate  an  intelligent  CAI  system  based  on  this 
knowledge.  It  is  obvious  that  a  consider¬ 
able  amount  of  knowledge  engineering  is 
needed  in  order  to  create  the  knowledge 
base  in  an  expert  system  format.  The  claim 
kbst  some  of  the  formalisation  effort 
bae  already  been  made  in  transforming  the 
^■"■•r  and  dictionary  information  from  the 
of  grammar  books  and  ordinary  dic¬ 


tionaries  into  formal  grammars  and  dic¬ 
tionaries  for  NLP .  How  inport ant  the  next 
step,  i.e.  the  transformation  of  such  com¬ 
puter-oriented  knowledge  into  an  expert 
system  knowledge  base  is,  remains  to  be 
seen. 

4 .  Application  of  neural  networks 

An  approach  to  modeling  intelligence  which 
is  now  an  alternative  to  the  traditional  AI 
approach,  is  the  use  of  neural  networks 
(Rumelhart  and  Clelland  1986) .  The  idea  is 
basically  to  model  neural  networks  in  a 
computer  by  defining  states  and  transitions 
between  states  in  a  multi-level  organisa¬ 
tion.  Such  a  network  is  then  trained  by 
giving  it  input  and  the  corresponding  cor¬ 
rect  behaviour.  E.g.  if  the  task  is  hyphe¬ 
nation  of  words,  the  network  will  be  given 
a  certain  amount  of  words  with  indication 
of  all  possible  hyphenation  points.  When 
given  a  new  word  the  network  will  hyphenate 
with  a  certain,  and  normally  rather  good 
probability  of  correctness. 

The  hyphenation  application  has  been  deve¬ 
loped  e.g.  for  Danish,  and  it  works  reason¬ 
ably  well.  On  the  other  hand  hyphenation 
programs  made  according  to  more  traditional 
principles  may  perform  as  well,  so  as  a 
research  result  this  application  is  in¬ 
teresting,  but  seen  only  from  the  applica¬ 
tion  point  of  view  it  is  of  less  interest. 

It  may  be  argued  that  hyphenation  is  a 
rather  simple  task;  it  is  mentioned  here 
because  it  constitutes  a  real  application 
of  neural  networks  to  a  NLP  problem.  Below 
I  will  describe  the  application  of  neural 
networks  for  a  subtask  of  NLP,  namely  syn¬ 
tactic  category  disambignation . 

One  of  the  major  problems  in  NLP  is  am¬ 
biguity  and  the  overgeneration  this  leads 
to.  Ambiguity  arises  at  all  levels  of  pro¬ 
cessing,  e.g.  for  each  word  of  a  written 
text  there  may  be  an  ambiguity  of  word 
class  (Eng.  can  is  either  a  noun  or  a 
verb) ,  within  a  word  class  there  may  be 
several  readings  or  meanings  (can  as  modal 
verb  or  full  verb) ,  etc.  Such  ambiguities 
may  lead  to  several  possible  structures  and 
semantic  interpretations.  It  is  one  of  the 
difficulties  of  NLP  to  disambiguate  or 
choose  between  parallel  structures.  For 
efficiency  reasons  disambiguations  should 
take  place  as  early  in  the  processing  as 
possible.  Benello  et  al.  1989  used  a  neural 
network  to  disambiguate  syntactic  cate¬ 
gories  . 

Since  people  use  the  context  to  disam¬ 
biguate  words,  the  network  was  also  given  a 
context,  4  unambiguous  words  proceeding  the 
unknown  (ambigious)  word,  and  one  follow¬ 
ing.  After  training  it  determined  syntactic 
category  with  95%  accuracy.  This  is  com¬ 
parable  to  results  of  wellknown  linguistic 
parsers,  -  but  again  like  for  the  hyphe¬ 
nation:  not  better  than  other  known  me¬ 
thods  . 

The  interesting  aspect  of  this  type  of  re¬ 
search  is  consequently  not  really  the  al¬ 
ready  obtained  application  result.  The  in¬ 
teresting  aspect  is  the  new  way  of  modeling 
intelligence.  It  is  particularly  inte¬ 
resting  because  there  are  reasons  to  be¬ 
lieve  that  the  human  brain  works  much  the 
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•am  way:  first  of  all  it  is  claar  that 
human  brains  hava  strong  constraints  on  tha 
amount  and  type  of  computation  thay  can  do, 
and  thasa  constraints  may  weaken  traditio¬ 
nal  linguistic  thaorias  claim  of  psycholo¬ 
gical  raality.  Mo  human  has  tha  capability 
of  capturing  at  tha  sama  tima  all  90  pos¬ 
sible  parses  of  "Still  water  runs  deep" 
which  a  traditional  computer  program  has  to 
examine  before  choosing  tha  correct  one. 

Apart  from  this  issue  on  psychological  rea¬ 
lity*  it  may  also  be  that  neural  networks 
find  larger  use  in  NLP  applications.  Tha 
future  will  show  what  happens  when  neural 
networks  are  built  which  have  to  cope  with 
more  complex  problems . 

5 .  Summary 

Two  very  important  areas  of  NLP  applicaton 
have  been  described  in  some  detail,  machine 
translation  and  computer  assisted  instruc¬ 
tion.  Both  fields  are  extremely  important, 
and  of  growing  importance.  Artificial  In¬ 
telligence  techniques  are  only  starting  to 
be  used  in  applications.  At  the  same  time 
an  alternative  model  for  artificial  intel¬ 
ligence  has  emerged:  neural  networks. 

Neural  networks  are  interesting  from  a 
theoretical  point  of  view,  because  they  can 
be  said  to  take  into  account  the  biology  of 
human  information  processing.  It  is  not 
possible  at  present  to  evaluate  the  poten¬ 
tial  of  neural  networks  for  application  in 
the  Language  Industries. 
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Abstract 

Conventional  full  text  retrieval  systems  often  omit  the 
pictures  from  the  material  they  display.  We  are  taking 
the  existing  machine-readable  text  of  the  American 
Chemical  Society  journals,  scanning  the  pages  from 
microfilm  and  extracting  die  images  from  the  text  by 
algorithms  which  analyze  the  digitized  bitmaps.  The 
combined  combined  pictorial  and  text  material  will 
then  be  used  with  full-text  search  to  provide  access  to 
the  complete  file.  The  major  experiments  to  be  done 
are  to:  (a)  measure  acceptability  of  the  electronic  sys¬ 
tems;  (b)  compare  full  text  search  with  search  of  titles, 
abstracts  and/or  indexing;  and  (c)  compare  presenta¬ 
tion  of  full  page  images  in  bitmap  format,  presentation 
of  text  in  ASCII  with  graphics  on  demand  from 
images,  and  traditional  paper  copies  of  the  journals. 
The  major  parts  of  this  research  are 

(1)  Software  to  partition  digitized  images  of  pages 
into  textual,  tabular  and  pictorial  areas.  This  is 
used  to  extract  the  graphics  material,  which  is 
then  matched  with  the  commands  referencing 
the  pictures  in  the  typesetting  tapes,  and 
prepared  for  display  as  bitmaps. 

(2)  Search  software  which  implements  conventional 
searching  capability  (Boolean  and  coordinate 
index  term  search)  on  the  full  text  of  the  jour¬ 
nals,  which  is  available  from  the  typesetting  and 
on-line  operations  of  the  American  Chemical 
Society.  Experiments  are  also  continuing  with 
the  use  of  singular  value  decomposition  to  group 
documents  and  concepts  to  aid  searching. 

(3)  Browsing  and  reading  software  to  help  people 
read  complex  journal  material  on-line,  by 
highlighting  matches,  formatting  paragraphs, 
and  providing  interactive  screen  displays. 

Introduction. 

Despite  many  predictions  ova  the  last  thirty  years  that 
electronic  information  systems  would  entirely  replace 
paper  (tee  tor  example  Samuel  1964),  even  in 
advanced  societies  scientific  publications  are  still 
mostly  read  in  traditional  form.  This  is  true  despite  the 
amribbility  of  moat  textual  material  in  a  machine- 
readable  format,  thanks  to  the  general  adoption  of 
word  processing  systems.  We  suspect  that  part  of  the 
problem  ia  the  lack  of  graphical  material  in  many  con¬ 
ventional  fall-text  retrieval  systems,  many  users  of 
which  see  words  without  illustrations  as  s  very  inferior 
sobatJgaie  for  tratfitional  journal  publication.  Our 
experiasents  provide  various  formats  of  full  text  with 
graphic*  to  see  what,  if  any,  kind  of  electronic  presen¬ 


tation  will  attract  users  away  from  conventional  publi¬ 
cations. 

This  effort  is  a  joint  project  between  the  American 
Chemical  Society,  Chemical  Abstracts  Service,  OCLC, 
Bellcore  and  Cornell  University.  Our  test  area  is 
chemistry,  thanks  to  (a)  the  availability  of  the  Ameri¬ 
can  Chemical  Society  backfile  of  full-text  chemical 
journals  online,  (b)  the  availability  of  the  Chemical 
Abstracts  Service  file  of  indexing  and  bibliographic 
information,  and  (c)  the  quality  of,  and  interest  in 
information  retrieval  shown  by,  the  Cornell  chemistry 
department  as  users. 

In  this  paper  we  will  discuss  some  of  the  questions 
raised  by  the  preliminary  work  so  far  and  the  various 
systems  being  proposed  to  answer  them.  The  major 
issue  is  how  to  efficiently  integrate  graphics  with  full 
text.  People  skim  journal  articles  very  rapidly,  and 
with  pictures  we  can  not  compensate  for  slow  display 
by  searching  rapidly,  since  it  is  difficult  to  search  pic¬ 
tures. 

2.  Information  content. 

Our  collection  is  based  on  the  text  files  of  the  Ameri¬ 
can  Chemical  Society.  These  are  derived  from  their 
computer  typesetting  production  facility  and  contain  a 
very  detailed  markup  of  the  individual  articles  (for 
example,  the  sentence  boundaries  are  marked).  There 
are  approximately  500,000  pages  of  material  available. 
Our  current  1 ,000  article  database  is  about  a  1% 
extract  (in  our  counting,  an  “article"  may  include 
something  as  small  as  a  one-paragraph  book  review). 

It  represents  the  first  twelve  issues  of  the  Journal  of  the 
American  Chemical  Society  for  1988.  Approximately 
20%  of  the  pages,  measuring  in  square  inches,  are 
figures.  There  are  about  4200  pages  in  the  file,  and  the 
total  text  is  about  30  Mbytes. 

We  have  three  sources  of  data  that  we  are  using  to 
construct  the  file.  These  are  (a)  the  ACS  text  files,  (b) 
the  Chemical  Abstracts  Services  files  of  indexing, 
abstracting,  and  bibliographic  data,  and  (c)  the 
microfilm  versions  of  the  original  journals.  The  intent 
is  to  get  the  text  from  the  ACS  and  CAS  files,  and  the 
pictures  from  die  microfilm.  The  reason  we  are  scan¬ 
ning  microfilm  rather  than  paper  is  that  microfilm, 
being  a  physically  robust  material,  can  be  scanned  fas¬ 
ter  in  bulk.  A  Mekel  M400  microfilm  scanner  is  used 
to  produce  images  at  either  200  or  300  dpi,  and  the 
pages  are  then  processed  in  several  steps. 

The  first  step  in  processing  is  to  sort  printed  from  sup¬ 
plementary  pages.  The  Journal  of  the  American 
Chemical  Society  provides  a  facility  for  authors  to 
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supplement  their  articles  with  pages  of  additional  data 
whidi  are  not  published  in  the  journal  as  mailed,  but 
which  are  on  the  microfilm.  We  have  to  identify  and 
eliminate  these  pages  in  order  to  provide  a  file  which 
matches  the  printed  journal.  Three  basic  techniques 
are  used  to  distinguish  printed  text  pages  from  other 
pages.  The  supplementary  pages  are  most  commonly 
prepared  on  ordinary  typewriters.  Note  that  all  tech¬ 
niques  depend  entirely  on  fairly  gross  properties  of  the 
images,  rather  than  any  kind  of  optical  character 
recognition. 

(1)  Average  bit  density.  The  typical  text  block  of 
densely  printed  typesetting  reaches  a  density  of 
over  20%  black  bits  (but  never  as  much  as  50% 
black).  Any  page  which  contains  a  region  which 
achieves  the  typical  text  density  is  deemed  to 
contain  text. 

(2)  Line  spacing.  The  sparing  between  lines  is 
remarkably  accurately  determined  by  an  auto¬ 
correlation  computation.  The  number  of  black 
bits  on  each  horizontal  scan  line  is  counted,  and 
used  to  make  a  vector  of  horizontal  density  vs. 
page  position.  This  is  actually  done  twice,  once 
for  the  left  half  of  the  page  and  once  for  the  right 
half.  The  vector  is  then  autocorrelated  at  each 
possible  shift  relative  to  itself;  the  first  peak  in 
the  autocorrelation  function  turns  up  the  line 
spacing.  The  line  spacing  used  in  JACS  text  is 
about  0.135  inches  (10  points);  the  line  spacing 
used  in  the  table  of  contents  is  -ho. ;  0.265 
inches  (19  points).  Any  page  wuh  the  inter¬ 
linear  spacing  characteristic  if  JACS  typesetting 
is  noted  as  a  text  page. 

(3)  Columnation.  JACS  is  normally  printed  in  dou¬ 
ble  column.  Each  page  image  is  examined  for 
vertical  strips  of  white  space  extending  at  least 
1/3  of  the  way  up  the  page  (some  pages  have  full 
width  tables  at  the  top).  The  number  of  such 
vertical  strips  are  counted  and  used  to  decide 
how  many  columns  this  page  used;  most  of  the 
supplementary  pages  are  multi-column  tables. 
Any  page  with  two  columns  of  about  the  right 
width  is  again  marked  as  a  text  page. 

It  would  seem  that  with  three  different  ways  of  mark¬ 
ing  pages  as  text  pages,  this  algorithm  would  be  fail¬ 
safe:  that  is,  it  would  be  more  likely  to  mark  supple¬ 
mentary  pages  as  printed  pages  than  the  other  way 
around.  Unfortunately  the  errors  are  about  balanced 
(although  rare:  some  dozens  of  failures  in  7000 
images).  The  reason  is  that  some  printed  pages  are 
filled  or  almost  filled  with  one  large  table  or  figure,  and 
thus  do  not  have  the  characteristics  of  a  densely 
printed  text  page.  The  mistakes  made  in  the 
classification  are  caught  as  the  page  numbers  are 
matched  up  with  die  images. 

The  next  step  is  reassembling  certain  page  images 
which  have  been  split  in  the  microfilming.  When  an 
article  ends  in  the  middle  of  a  page,  and  has  supple¬ 
mentary  pages  after  it,  there  is  a  microfilm  image  for 
the  portion  of  the  page  which  contains  the  end  of  the 
first  ankle,  followed  by  images  for  the  supplementary 
pages,  followed  by  a  microfilm  image  for  the  portion 
of  the  page  which  contains  the  beginning  of  the  next 
artkie.  To  obtain  die  appearance  of  the  journal  as 


printed,  we  logically  OR  together  the  bits  from  the  two 
text  halves,  having  deleted  the  supplementary  images 
in  the  previous  step. 

Finally,  some  hand  examination  is  required  per  journal 
issue.  Not  only  do  we  spot  check  the  previous  steps, 
but  each  issue  contains  a  few  pages  of  prefatory 
material  (e.g.  instructions  to  authors)  which  may  took 
like  text  but  are  not  part  of  the  paginated  sequence  of 
the  main  journal. 

We  now  must  match  up  the  images  of  the  articles  with 
the  text  of  the  articles.  The  text  tapes  contain  authors, 
titles,  and  so  on  but  not  page  numbers.  However,  the 
Chemical  Abstracts  Services  tapes  do  contain  the  page 
numbers,  and  can  be  used  to  make  the  necessary  table 
of  article  titles  vs.  page  number.  For  issues  where,  for 
administrative  reasons,  we  are  processing  the  issues 
before  the  Chemical  Abstracts  tapes  have  arrived,  it  is 
not  difficult  to  type  in  the  list  of  page  numbers  for  each 
article. 

At  this  stage  we  have  data  files  which  contain  the  full 
text  of  each  article,  including  figure  captions  and 
tables,  in  ASCII,  from  the  American  Chemical  Society 
data.  We  also  have  files  containing  the  abstract,  index 
terms,  and  bibliographic  citation  from  the  Chemical 
Abstracts  data.  The  full  article  is  also  available  as 
page  images. 

The  next  step  is  to  identify  the  specific  figures  which 
correspond  to  each  figure  referenced  on  the  text. 

There  are  four  kinds  of  graphics  which  we  identify 
(tables,  figures,  equations  and  ‘schemes’  -  chemical 
structures)  using  techniques  similar  to  those  above  for 
classifying  pages.  These  will  be  discussed  in  detail  in 
a  later  paper;  for  previous  work  on  this  subject  see 
Fletcher  and  Kasturi  (1987).  For  each  picture  of  table, 
the  typesetting  tapes  specify  the  size,  but  not  the  posi¬ 
tion  on  the  page.  However,  the  sequence  of  figures 
within  the  article  is  given,  and  this  is  used  to  match  up 
the  recognized  images  with  the  text  references.  Simi¬ 
larly,  the  footnotes  and  the  Ascii  forms  of  the  tables 
are  matched  up  with  the  places  where  they  are  refer¬ 
ence  (this,  of  course,  does  not  involve  any  image  pro¬ 
cessing). 

3.  Presentation. 

There  are  several  different  ways  in  which  the  users 
wish  to  access  information  in  these  journals.  We  dis¬ 
tinguish  at  the  moment  three  major  forms  of  access 
which  require  quite  different  facilities: 

(1)  Reading  a  particular  article  whose  citation  they 
have. 

(2)  Looking  for  information  on  a  particular  subject. 

(3)  Skimming  through  issues  for  current  awareness 
or  general  intellectual  curiousity. 

In  order  to  make  the  machine-readable  form  accept¬ 
able,  we  have  to  serve  each  of  the  information  needs 
above.  It  is  not  clear  that  a  single  computer  interface 
can  provide  for  all  of  these  purposes.  Note  that  for 
purpose  (2),  searching  for  a  particular  subject,  the 
printed  journal  needs  to  be  supplemented  also,  in  this 
case  with  the  [Minted  version  of  Chemical  Abstracts. 

In  addition  to  questions  of  presentation  and  style  of 
interfaces,  we  must  also  decide  what  kinds  of  informa- 
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tion  are  presented  to  the  user  and  in  what  format.  The 
biggest  such  question  is  how  to  use  the  Chemical 
Abstracts  indexing  effectively  since  it  contains  both 
textual  structure  and  term  normalization,  facilities  that 
users  coukl  exploit  to  improve  their  searches  but  which 
many  users  may  not  understand  well  enough  to  use 
without  some  aids. 

Interviews  with  the  users  have  found  that  skimming  is 
a  major  part  of  their  use.  In  this  mode,  they  flip 
through  rapidly,  and  claim  that  they  rely  mainly  on  the 
pictures.  Chemists,  particularly  organic  chemists,  are 
very  skilled  at  interpreting  chemical  structure 
diagrams,  and  visualizing  three-dimensional  structures 
of  the  compounds  they  study.  Such  visualization  is 
very  important  in  their  work,  whether  it  be  synthesis  of 
molecules  or  analysis  of  their  properties.  Thus,  rapid 
display  of  properly  drawn  pictures  is  important;  nei¬ 
ther  systematic  nomenclature  nor  line  notations  nor 
bad  approximations  of  the  actual  structures  will 
suffice. 

For  these  reasons,  the  American  Chemical  Society 
suggested  that  we  keep  and  display  the  page  images  as 
well  as  the  text  plus  pictures  separately.  This  will  per¬ 
mit  us  to  make  some  kind  of  measurement  of  the  value 
of  the  original  typography  vs.  a  reformatted  text.  We 
support  at  present  three  different  interlaces  for  detailed 
physical  presentation  of  the  material  and  expect  to  see 
which  of  these  the  chemists  prefer.  Each  one  deals  in 
a  different  way  with  inability  of  current  computer 
workstations,  even  large  screen  workstations,  to  equal 
the  resolution  of  printed  paper.  Typical  screen  resolu¬ 
tions  are  72  or  at  most  100  dpi,  and  a  maximum  of 
perhaps  1000  vertical  pixels  down  a  page;  printing  is 
at  least  1000  pixels  per  inch,  and  even  conventional 
laser  printers  and  copiers  have  several  times  the  accu¬ 
racy  of  the  typical  screen. 

These  different  interfaces  will  include  Super  book,  the 
OCLC  Diadem  system,  and  a  simplistic  system  called 
Pixlook.  The  Diadem  system  will  be  described  in  a 
later  publication;  the  other  two  are  presented  here. 

(1)  Superbook,  by  Egan,  Gomez  and  Remde  (1989), 
displays  Ascii  text  with  graphics  on  demand. 

The  advantages  of  displaying  Ascii  are  that  the 
display  is  better  matched  to  the  capabilities  of 
the  computer  terminal,  it  is  possible  to  reformat 
the  text  to  match  the  window  size  chosen  by  the 
user,  and  that  the  displayed  text  can  be  matched 
to  the  users' query.  For  example,  after  a  word 
or  phrase  search,  the  text  display  begins  with  the 
start  of  die  paragraph  in  which  the  users'  search 
terms  appear.  The  terms  are  highlighted.  In 
addition,  another  window  displays  the  tables  of 
contents  of  the  material  so  that  the  user  can 
locate  the  current  page  in  the  context  of  the 
overall  publication.  Searching  in  Superbook  is 
based  on  co-occurrence  of  terms  within  para¬ 
graphs;  a  variety  of  aids  such  as  term  truncation 
and  aliasing  are  maintained  as  well. 

(2)  Pixlook  displays  images  of  the  original  pages.  It 
performs  coordinate  index  or  Boolean  searches, 
and  is  able  to  search  specific  document  fields 
and  also  supports  truncation  aid  suffixing.  For 
each  search,  the  matched  list  of  titles  is 
presented  and  the  user  can  choose  which  articles 


to  read.  Each  article  is  presented  first  as  a 
display  of  an  image  at  100  dpi  resolution,  1  bit 
per  pixie,  from  the  original  page;  this  means  that 
nearly  the  entire  page  can  fit  in  a  window  on  a 
workstation  with  a  1000x1000  display.  The  user 
can  then  zoom  in  on  a  portion  of  the  image  to 
200  dpi  resolution,  which  is  normally  adequate 
for  reading  (100  dpi  is  quite  adequate  for  a  quick 
view,  and  can  be  read,  but  is  not  satisfactory 
visual  quality  for  most  readers  for  the  long 
term).  The  user  can  move  around  in  the  image 
or  move  backwards  and  forwards  in  the  text  It 
is  also  possible  to  browse  the  pages  of  tables  of 
contents  as  images  and  not  use  the  search 
features  at  all. 

A  variant  on  this  program,  for  those  who  wish  to 
see  images  primarily,  displays  only  the  pictures 
from  the  articles  which  match  the  search  terms. 
These  can  be  brought  onto  the  screen  and  exam¬ 
ined  quickly  in  low  resolution.  Then,  the  user 
can  pop  up  the  full  pages,  with  text,  from  the 
articles  whose  pictures  seem  interesting. 

As  examples  Figure  1  shows  a  sample  of  a  Superbook 
screen,  while  Figure  2  shows  a  Pixlook  display  at  100 
dpi  and  Figure  3  at  200  dpi.  We  are  planning  to  run 
experiments  on  the  comparative  acceptability  to  the 
users  of  these  different  presentation  formats.  Experi¬ 
ments  with  Superbook  in  the  past  have  shown  superior 
performance  for  searches  aimed  at  specific  target 
information.  Not  only  is  the  searching  efficient,  there 
being  no  feasible  way  to  match  with  paper  the  facili¬ 
ties  of  full-text  electronic  search,  but  the  display  is 
effective  at  calling  the  user’s  attention  to  the  material 
found.  However,  we  have  not  evaluated  Superbook 
formally  in  applications  such  as  skimming,  nor  for  the 
problem  of  known-item  retrieval  in  a  large  document 
collection.  Image-based  displays  may  well  be  superior 
in  these  applications  since  they  retain  the  format  the 
users  find  familiar  and  which  has  been  tailored  over 
the  years  for  effective  use  by  chemists. 

4.  Searching. 

Not  only  do  the  different  systems  we  have  imple¬ 
mented  include  different  searching  methodologies,  but 
the  user  will  have  a  variety  of  sources  of  information 
to  search.  There  are  of  course  the  usual  facilities  of 
title,  author,  abstract  and  the  full  text  terms.  In  addi¬ 
tion  Chemical  Abstracts  defines  for  each  article  a  very 
precise  set  of  phrases,  for  example  phenyloxirane, 
prepn.  of,  by  alkene  tpoxidn.  in  presence  of  nickel 
catalysis  which  are  printed  in  their  subject  index  (in 
this  case  alphabetically  under  Oxiiane).  These  entries 
are  defined  for  the  purpose  of  display  in  an  index  peo¬ 
ple  are  browsing,  whether  on  paper  or  electronically. 
Thus,  we  can  provide  the  user  a  choice  of  whether  to 
search  for  free  text,  selected  index  terms,  or  author 
names  or  other  entry  points;  and  also  of  several  dif¬ 
ferent  interfaces  and  ways  of  specifying  die  searches. 

The  obvious  danger,  particularly  in  such  a  large  collec¬ 
tion,  is  that  people  will  be  drowned  in  retrieved  items. 
Whether  this  is  based  upon  text  retrieval  (words  like 
‘NMR’  or  ‘bond*  appear  more  than  once  per  article,  on 
average),  or  upon  figures  (there  are  an  avenge  of  six 
graphic  items  per  article),  there  is  a  danger  of  over- 
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loading  the  user.  For  different  problems,  there  will  be 
different  strategies  of  dealing  with  the  amount  of  infor¬ 
mation  available.  Some  of  these  strategies  depend 
upon  the  user  and  some  upon  the  computer. 

In  the  Pixlook  program,  for  example,  there  are  basi¬ 
cally  only  conventional  facilities  for  narrowing 
searches.  The  intent  is  to  provide  very  fast  pop-up  of 
matched  pages  and  rely  on  the  user  to  recognize  what 
he  is  interested  in.  In  the  right  circumstances  this  can 
be  very  fast  (for  example,  in  the  process  of  checking 
the  page  images  to  see  whether  the  pictures  had  been 
correctly  noted  SO  bpi  images  were  used,  and  although 
these  can  not  be  read,  they  can  be  reviewed  at  better 
than  one  per  second).  We  would  like  to  compare  the 
attractiveness  of  fast  review  of  images  to  adding 
Boolean  ands  to  queries  as  a  way  of  dealing  with  a 
large  number  of  retrieved  documents.  Note  that  in  this 
interface  the  user  must  type  the  search  terms  and  there 
is  no  facility  to  help  suggest  any. 

By  contrast  Supcrbook  provides  a  variety  of  facilities 
for  structuring  the  retrieval  to  give  the  user  a  view  of 
what  has  come  back.  The  overall  list  of  articles  comes 
with  indications  of  which  sections  of  the  journal  con¬ 
tain  how  many  instances  of  each  term,  and  thus  per¬ 
mits  the  user  to  feel  in  some  way  oriented.  As  the  user 
expands  the  table  of  contents,  more  detailed  indica¬ 
tions  of  how  many  terms  occurred  in  which  sections 
are  shown.  Thus,  the  user  can  maintain  some  context 
and  feeling  for  how  the  retrieved  passages  are  distri¬ 
buted  around  the  documents.  This  works  best,  of 
course,  in  a  collection  which  contains  one  continuous 
and  organized  document,  but  still  has  value  in  the  con¬ 
text  of  a  collection  of  many  articles,  for  each  has  sub¬ 
sections.  Superbook  also  makes  it  easy  to  select  terms 
from  existing  material  to  be  searched  again,  using  the 
mouse;  the  user  can  thus  not  only  be  prompted  by 
looking  at  the  document  but  also  need  not  worry  about 
spelling  the  terms  correctly. 

We  have  also  briefly  investigated  clustered  methods  of 
searching.  The  original  intent  was  to  see  if  the  articles 
could  be  grouped,  instead  of  by  issue,  by  computing 
term  overlaps  and  then  running  hierarchical  clustering 
algorithms.  However,  these  clusters  did  not  seem  con¬ 
sistently  intuitively  sensible  to  some  chemists  who 
considered  them,  and  so  we  are  going  to  use  the  sec¬ 
tions  of  Chemical  Abstracts  into  which  the  articles  are 
classified.  This  divides  the  collection  into  fewer  pans 
than  would  be  possible  with  automatic  clustering,  but 
the  categories  make  sense  to  chemists  and  are  rela¬ 
tively  familiar. 

A  more  interesting  question  is  whether  any  clustering 
can  be  done  of  the  graphical  figures.  The  importance 
of  diagrams  to  the  chemists  would  make  it  attractive  to 
have  some  technique  for  grouping  the  figures.  Since 
we  only  have  bitmaps,  it  would  be  difficult  to  do  this 
directly,  but  Tom  Landauer  at  Bellcore  has  suggested 
that  we  could  classify  the  figures  on  the  basis  of  the 
text  of  their  captions.  This  is  not  a  complete  answer; 
many  of  the  structural  diagrams  are  printed  as 
‘schemes’ and  do  not  have  captions,  for  example.  But 
it  wonld  be  very  useful  to  be  able  to  search  for  figures 
which  are  similar  to  other  figures  or  to  present  them  in 
some  coment-relsted  way. 


5.  Conclusion. 

Many  people  insist  that  they  will  never  abandon  paper 
for  any  kind  of  screen  display.  They  point  to  the 
amount  of  reading  they  do  removed  from  workstations 
and  networks  (on  airplanes,  in  traffic  jams,  even  on 
canoe  trips).  Electronic  information  delivery,  how¬ 
ever,  makes  possible  searching  and  browsing  in  ways 
that  potentially  offer  greatly  improved  performance,  if 
we  can  manage  to  deliver  this  in  a  way  the  users  find 
appropriate,  acceptable  and  effective. 

Our  system  supports  a  variety  of  approaches  to  retriev¬ 
ing  both  text  and  images  for  use  in  chemical  informa¬ 
tion  delivery.  Preliminary  interviews  with  chemists 
and  experience  with  these  systems  suggests  that  effec¬ 
tive  presentation  of  graphics  is  very  important  for  user 
satisfaction  with  the  system.  We  are  planning  a  series 
of  experiments  to  measure  the  utility  of 

•  a  variety  of  kinds  of  information  to  search, 
including  both  free-texl.  fielded  data,  and  index¬ 
ing; 

•  a  variety  of  search  strategies,  including  brows¬ 
ing  indexes.  Boolean  and  coordinate  indexing 
search,  and  other  techniques; 

•  a  variety  of  display  strategies,  including  images 
of  original  pages  and  resynthesized  text 

We  expect  to  team  what  kind  of  computer  facilities  a 
system  must  include  in  order  to  cope  with  the  wide 
variety  of  user  information  needs  that  chemical  infor¬ 
mation  satisfies  today,  and  if  possible  to  improve  our 
ability  to  respond  to  these  needs  in  a  way  the  users 
like. 
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Note 

In  response  to  requests,  some  of  the  viewgraphs  used 
during  the  presentation  are  included  here  (pages  5-8 
to  5-13). 


Figure  1.  Superbook  example 
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isolation  and  silica  gel  chromatography  (SGC)  an  80%  yield  of 
enol  inflate  5.  A  solution  of  5  and  Pd(PPh3)4  (5.7  mol%)  in 
benzene  was  stirred  at  16  #C  for  1 5  min  and  then  treated  sue* 
cessively  with  a  benzene  solution  of  acetytenic  OBO  ester  fn  (1 
equiv),  n-propylamine  (2.3  equiv),  and  0.5  equiv  of  cuprous  iodide, 
all  at  16  °C  to  give  after  4  h  at  16  °C,  extractive  isolation  and 
SGC  76-84%  of  the  coupling  product  7  (2:1  mixture  of  anomers), 
mp  44-47  #C.U  The  triple  bond  of  7  was  reduced  by  reaction 
with  1.5  equiv  of  dicydohexylborane  in  tetrahydrofuran  (THF) 
(0  *C  for  2  h,  23  *C  for  0.5  h),  followed  by  protonotysis  (acetic 
add  23  *C  for  16  h),  and  decomposition  of  residua)  boranes  (HjO* 
23  *C,  pH  10).  The  resulting  solution  was  acidified  to  pH  3  with 
1  N  hydrochloric  acid,  brought  to  pH  11  (vigorous  stirring,  4  h) 
and  reacidiffcd  to  pH  3  to  cleave  the  OBO  ester  unit,15  and  the 
(ZMcfinic  acid  8  was  isolated  by  extraction  and  removal  of 
solvent  (70%  yield,  colorless  oil).  Conversion  of  8  to  the  corre- 


(10)  McMumy,  J.  E.;  Scott.  W.  J.  TeinMnm  hu.  1913. 24, 979-983. 

(11)  TV  preparation  of  6  waa  carried  out  aifoOowi.  4-Pentynoicarid  wss 
tenvetiod  to  tV  acid  chloride  (oulyi  chloride,  beasenc,  23  "C)  which  was 

Ptg«  650  (J  iovSt  q 
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Why  not? 

“Libraries  for  books  will  have  ceased  to  exist 
in  the  more  advanced  countries  except  for  a  few 
which  will  be  preserved  at  museums.” 

-  Arthur  Samuel  (IBM) 


prediction  of  1984,  made  in  1964 


Remember  microfilm? 

Don’t  these  quotes  seem  familiar? 

Microfilm  promises  “to  have  an  impact  on 
the  intellectual  world  comparable  with  that 
of  the  invention  of  printing”  -  1936. 

Microphotography  is  "one  of  the  most 
important  developments  in  the  transmission 
of  the  printed  word  since  Gutenberg"  -  1940. 

Not  only  did  hypertext  not  Invent 
text,  it  didn't  even  invent  hype. 

Conclusion: 

Just  fast  page  turning  isn’t  enough: 
need  searching  to  make  a  difference. 


The  Problem 

People  search  online,  but  they  don't  read  online 
Why? 

-  Costs  too  much 

-  Bad  interfaces 

-  No  graphics 

Needed  for  Solution 

Extraction  of  graphics 
Procedures  for  display 
Different  human  interfaces 


Our  Plan 

Create  an  automated  chemistry  library 

Text:  from  the  American  Chemical  Society  file 
10  years,  20  journals,  key  U.S.  publisher 
full  text  with  typographic  markup 
Pictures:  scanned  from  microfilm 
graphics  extracted  from  page  images 
most  graphics  are  line  drawings,  e.g. 
chemical  structures 
spectra 

reaction  pathways 

about  20%  of  journal  is  pictures  (sq.  inches) 
Interface 

Free  text  searching  or  index  term  searching 
Page  image  display  or  ascii  display  of  text 
Browsing  interfaces  as  well  as  searching 
Experiments 
Electronics  vs.  paper 
Different  electronic  modes 
Design  of  a  better  system  for  chemists 


Handling  images 

1 .  Identifying  text 

(microfilm  contains  40%  supplementary  pages) 

-  overall  density  must  be  20%  black 

-  line  spacing  10  points  (autocorrelation  function) 

-  columnation  (look  for  white  strips) 

2.  Matching  text  to  articles 

-  need  table  of  contents 

3.  Skew  removal 

-  mostly  we  lucked  out 

4.  Identifying  figures,  tables,  schemes  and  equations 
above  parameters  plus 

-  aspect  ratio 

-  horizontal  and  vertical  lines 

-  presence  or  absence  of  captions 

-  white  space  distribution 


Interfaces 


Superbook 

-  table  of  contents,  fisheye 

-  text  with  marginal  icons 

-  free  text  word  search 

-  popup  graphics 

Alternatives  being  provided 

-  Boolean  search 

-  representative  icons  for  graphics 

-  Page  image  display 

-  Different  resolutions 


Experiments 

Display:  Paper  vs.  Images  vs.  Ascii 

Search:  Boolean  vs.  Coordinate  Index 

Vocabulary:  Free  text  vs.  Indexing 

Tasks:  Known  item  vs.  Subject  vs. 
Current  Awareness 


Conclusions 

This  is  harder  than  we  expected! 

Image  technology  permits 

-  domain  independent  view 

-  retention  of  typographic  quality 

I  believe  we  will  have  a  system  the 
chemists  will  prefer  to  paper. 


black  pixel  count  black  pixel  count  black  pixel  count 
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figure  I.  Sc  h<  mines  of  the  system  vwd  for  swerving  holographic  in¬ 
terferometry  in  »LMs.  Ar*  ■  »r|on  ion  laser;  *S  «  beam  splitter.  • 
•  mirror;  Q  m  ^uirti.  cuvette;  BL M  ■  bilsycr  lipid  membrane;  M  ■ 
microscope;  C  »  camera;  Hf  •  holographic  plilr;  f  "  totuK  I  ■  lOK 
and  4  ■  interbeam  distance. 


ri(urc  l  Complex  interferometric  rrin|c  pattern  observed  in  a  thick 
GMO  film  prior  to  thinning  to  a  ILM. 


Secoud.  advantage  bat  been  taken  of  (leomctnei  to  produce 
distinct  interference  patterns  of  mol  cento  that  are  incorporated 
onto  the  surface  of  a  ILM.  The  power  of  holographic  interfer¬ 
ometry  it  demonstrated  in  Ihe  present  report  by  showing  differ¬ 
ences  between  thick  lipid  films  and  true  BLMs,  as  well  as  by 
determining  the  shape*  and  sitee  of  cadmium  sulfide  (CdS) 
particles  in  situ  generated  on  glyceryl  monooleale  (GMO)  BLMs. 
Evidence  is  also  provided  here,  by  holographic  interferometry,  for 
the  presence  of  Mcrocyaninc  540  on  the  surface  of  GMO  BLMa 

Experiments!  Section 

Mcrocyaninc  540  (Sigma),  glyceryl  monooicste  (GMO.  Nu- 
check  Co.),  cadmium  chloride  (AMtteh),  and  hydrogen  sulfide 
( MathcMm)  were  used  as  received-  Water  was  purified  by  means 
of  a  Millipore  Milti-Q  system. 

BLMs  were  formed  across  a  I  OO-mm-diameter  hole  in  a  thin 
(0.10-0.15  mm  thick)  Teflon  film,  p'eeed  diagonally  in  a  rec¬ 
tangular  quartz  cuvette  The  ceil  was  filled  with  2.0  mL  of  water 
at  ambient  temperature.  BLMs  were  made,  as  reported  previ- 
oualy,*""  by  'painting'  decane  rolutinne  of  freshly  prepared  (Cl. 
200  mg  in  1.0  mL)  GMO  across  Ihe  pinhole.  Thinning  of  the 
initially  formed  film  to  a  Mack,  btmolccular  (50  *  5  A),  thick 


(?l  tarsi.  S.;  Zban.  X.  X  .  Retard).  R..  Fesdtrr.  1  H.  J  Wyi  Chew 
lent.  II.  mi  /W  X.  K  .  Rare;.  S .  Rataadi.  R  ;  fasdler,  I.  II.  J.  .4m. 
C»*.«  foe  PnUB.  I  in.  mil 

•  1|  x.  K  .  r.ndWf. )  It.  J  n,<  CUm  IWU  SO.  J»e*. 
t»l  i»«.  X  X  :  fandUr.  J.  M  /.  Myc.  (hem  l*U.  ej.  JIM. 

(IO|  ZVaa,  X  K.;  Mmt,  P.  k  faadWr.  I  H  /  M/i  Them ,  ia  prtw 
tilt  Rare!,  S .  Readier,  J.  H.  J.  A m.  (hem.  foe.  in  praaa. 


figure  X  Parallel  holographic  interference  fringes  observed  in  .i  •  <  , 

2-nm-tkick  GMO  BLM. 


figure  A  Interferometric  fringe  observed  in  SO  A  5  A  thick  GMO  BI.M 
subsequent  to  the  in  tHu  deposition  of  CdS  particles. 


lipid  membrane  was  monitored  by  observing  the  light  reflected 
by  the  BLM.  Illuminatioa  was  provided  by  a  150-W  tenon  lamp 
via  a  500- nm  cutoff  filter  and  an  optical  filter.  The  reflected  light 
was  observed  through  an  Olympus  P.M-IO-M  microscope  coufded 
log  TV  monitor  and  videnrecorder  via  a  NEC  NC-I  CCD  camera 
Interferometric  putrerns  were  photographed  with  a  35-mm  Nikon 
M-35S  camera  using  Kodak  color  print  film  (400  ISO).  Exposure 
limes  were  typically  0.1  s. 

Mcrocyaninc  540  was  introduced  into  one  side  of  the  solut'.vn 
bathing  the  BLM  by  injection  (30  nL.  I0r>  M  aqueous  »|.*.A 
Semiconductor  formation  on  the  BLM  has  been  descriK- 
Briefly,  subsequent  to  cslaMishing  the  presence  of  a  true  D>  M. 
200  pL  of  10**  M  CdClj  was  introduced  into  one  side  and  a 
stoichiometric  amount  of  HjS  was  slowly  injected  into  tbc  opposite 
side  of  tbc  BLM.  Within  5  min,  the  semiconductor  particles 
became  visible  and  grew,  over  the  neat  I  -2  h.  to  a  thin  film  on 
Ihe  BI.M.  Semiconductor-containing  BLMs  proved  to  be  highly 
stable  and  rigid.' 

The  schematics  of  the  cepcrimental  setup  used  fur  the  ho'ng- 
raphy  are  shown  in  Figure  I .  The  beam  of  the  argon  ion  laser 
(Spectra  Physics  2020.  100-200  mWj  was  divided  by  a  beam 
splitter  (BS).  One  helf  of  the  team  was  allowed  to  imp  nS- 
direetly  on  the  BLM.  The  other  half  was  diverted  by  a  mirror 
(m)  prior  to  reaching  the  BLM.  This  arrangement  produced  a 
fringe  pattern  nn  the  BLM. 

Revolts  and  Discussion 

A  typical,  complex  interferometric -fringe  pattern  ob»ct**d 
during  the  thinning  of  the  OMO  is  shown  in  Figure  2- 
(Xtr.rfM  lines  are  indicative  of  flat  sections,  while  the  dni.mr»  •»* 
attributable  to  thick  local  ixitcScs  of  the  surfactant  With  thinr 
the  contours  move  around  and  gradually  dimmish  Appear  • 


Pag*  7162,  J.Phys.  Cham,  vol.92, 1988 
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KAaumA 

Les  techniques  d' analyse  du  langage  naturel 
permettent  actuellement  un  traitement 
avancA  de  textes  intAgraux.  En  informat ique 
documentaire,  elles  trouvent  ainsi  de 
nouvelles  applications  pour  les  tiches 
d' indexation,  d'interrogation  et  de 
recherche,  et  enfin  de  traitement 
bibliomAtrique .  Nous  prAsentons  dans  cet 
article  des  applications  rAelles  dans  un 
environnement  micro-informatique  sur  les 
themes  de  l'aide  A  1*  indexation  et  de  la 
bibliomAtrie. 

D'une  part,  durant  1' At ape  d* indexation 
d'un  document  dans  une  base  documentaire, 
une  analyse  rapide  d'un  rAsumA  ou  d'une 
introduction  permet  d'associer  au  texte  un 
ensemble  de  descripteurs  appartenant  A  un 
thesaurus,  Une  analyse  un  peu  plus  fine 
permet  1 'extraction  de  descripteurs  libres 
mais  caractAristiques  du  document  du  point 
de  vue  linguistique. 

O'autre  part,  tout  traitement 
bibliomAtrique  repose  sur  une  analyse 
statistique  d'un  volume  important  de  fiches 
document aires .  Un  des  apports  des 
techniques  d'analyse  du  langage  est  la  mise 
en  Evidence  de  descripteurs  libres  (non 
dAfinis  a  priori)  statistiquement  et 
linguistiquement  intAressants .  Ces 
descripteurs  competent  les  mots  clA  des 
fiches  documentaires  et  offrent  done  une 
mellleure  base  aux  traitements 
bibliomAtriques  usuels. 


Introduction 

La  chalne  d' inf ormatique  documentaire 
comporte  plusieurs  tiches  dont  les  plus 
lmportantes  sont  1' indexation,  la  recherche 
et  1* interrogation.  Des  traitements 
bibliomAtriques  et  inf omAtriques  sont 
apparus  plus  rAcemment  afin  d'asaurer  un 
suivi,  et  non  plus  une  exploitation,  des 
bases  documentaires. 

De  par  la  montAe  en  puissance  des  micro- 
processeurs  et  de  par  les  progrAs 
thAoriques  des  annAes  70-80  en  informatique 
linguistique,  les  techniques  d'analyse  du 
langage  naturel  coamencent  A  apporter  des 
aides  intAressantes  et  adaptAes  A 
1* informatique  documentaire. 

Depuis  1987,  1'approche  suivie  A  SELISA 
consiste  A  dAvelopper  des  outils  d'aide 


pour  les  tAches  de  bibliomAtrie  et 
d ' indexation .  Les  applications  ainsi 
congues  reposent  sur  un  module  de 
traitement  du  langage  frangais,  module 
opArationnel  depuis  Septembre  1989  sur 
micro-ordinateurs  compatibles  PC. 

Le  present  article  contient  tout  d'abord  un 
descrlptif  de  1'approche  linguistique 
retenue . 

L'adaptation  pratique  de  cette  approche  aux 
thAmes  de  la  bibliomAtrie  et  de  l'aide  a 
1* indexation  est  ensuite  dAtaillAe.  Des 
limitations  sont  nAcessaires  afin  d'assurer 
d'une  part  une  certaine  gAnAralitA  a 
l'outil,  et  d'autre  part  des  temps  de 
traitement  raisonnables . 

Enfin  sont  prAsentAes  deux  applications 
rAalisAes  dans  un  environnement  micro- 
informatique  sur  ces  deux  thAmes.  Ces 
applications,  limitAes  au  traitement  du 
Frangais,  sont  actuellement  en  phase 
d'intAgration  et  de  test  pour  la  base 
documentaire  TELEDOC,  base  rassemblant  des 
publications  techniques  dans  le  domaine  des 
telecommunications . 


1.  L' approche  linguistique  retenue 

Le  traitement  du  langage  nAcesslte 
d'Atablir  une  distinction  entre  divers 
niveaux  de  complexity  de  la  langue  :  mot, 
phrase,  discours. 

Les  traitements  du  mot  consistent  A  en 
reconnaltre  et  A  en  controler  toutes  les 
formes  dArivAes  (fAminin,  pluriel, 
conjugaison,  ...).  Ceci  constitue 
actuellement  une  tAche  thAoriquement 
maltrisAe  qul  ne  nAcessite  que  la 
definition  de  lexiques.  A  titre  d'exemple, 
un  lexique  relativement  complet  du  frangais 
comprend  environ  70.000  mots  (hors  formes 
dArivAes) . 

Les  traitements  de  la  phrase  sont  beaucoup 
plus  complexes  et  se  partagent  entre 
analyses  syntaxique  et  sAmantique.  Bien 
qu'aucune  de  ces  composantes  n'ait  pour 
1' instant  regu  de  solution  complAte, 
plusieurs  approches  thAoriques  apportent 
des  rAsultats  exploitables . 

D'une  part,  les  approches  purement 
syntaxiques,  principalement  issues  des 
travaux  de  N. Chomsky  (1),  offrent 
l'avantage  de  n'utiliser  que  des 
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informations  grammaticales .  Par  contra, 
alias  ne  permettent  pas  une  gestion 
efficace  da  la  synonymie  at  das 
significations  voisines  (Ex:  "vouloir 
manger"  est  Equivalent  E  "avoir  faim") . 
D'autre  part,  las  approches  sEmantiques 
visent  E  affiner  1' analyse  syntaxique  d'une 
phrase  grEce  au  sans  das  mots  qui  la 
composent.  Ceci  doit  permettre  da  lever  das 
ambigultEs  at  d'associer  das  expressions  da 
significations  voisines.  Par  contra,  ces 
approches  sEmantiques  nEcessitent 
d'associer  E  chaque  terme  d'un  lexique  une, 
ou  plusieurs,  dEfinition.  La  complexitE  at 
le  volume  de  cette  tSche  sont  tels  qu'il 
est  illusoire  d'espErer  constituer  un 
lexique  sEmantique  important  et 
exploitable . 

Pour  plus  d' information  sur  les  di verses 
approches  de  1* analyse  de  la  phrase,  le 
lecteur  peut  se  rEfErer  E  (2)  . 

L'analyse  du  discours  est  encore  d'un 
niveau  de  complexitE  tel  que  la  plupart  des 
travaux  sur  ce  thEme  sont  menEs  par  des 
laboratoires  de  recherche.  L'Etat  actuel 
des  recherches  est  dEcrit  dans  (3)  . 

L'approche  retenue  E  SEL1SA  a  EtE  de 
dEvelopper  un  analyseur  de  phrases 
combinant  syntaxe  et  sEmantique.  Ce 
programme,  baptisE  LangNat,  exploite  des 
informations  sEmantiques,  mals  prend 
Egalement  en  compte  des  termes  sans 
dEfinitioi..  II  dispose  Egalement  d'un 
lexique  grammatical  de  77.000  mots  (hors 
formes  dErivEes)  du  frangais  gEnEral. 


2 .  Adaptation  E  1 ' inf oraatique 
document airs 

En  informatique  documentaire,  nous  nous 
intEressons  au  problEme  de  1 '  identification 
de  descripteurs  E  partir  de  texte  intEgral, 
ce  qui  concerne  E  la  fois  les  thEmes 
d' indexation  et  de  bibliomEtrie .  Puisque, 
dans  les  deux  cas,  les  volumes  de  textes  E 
traiter  sont  relativement  importants,  il 
est  indispensable  d'effectuer 
principalement  des  traitements  simples  et 
rapides,  le  recours  aux  traitements 
complexes  et  longs  devant  se  produire  de 
fagon  limitEe. 

Dans  cette  optique,  nous  avons  chois i  un 
programme  E  deux  niveaux 

a)  Des  traitamanta  aorpho-lexicaux 
des  mots,  permettent  d'associer  trEs 
rapidement  un  ensemble  de  descripteurs 
E  un  document  ou  E  un  ensemble  de 
fiches  documentaires .  De  plus, 
certains  da  cas  traitements  sont 
totalement  indEpendants  de  la  base 
documentaire  considErEe. 

b)  Un  traiteaent  ayntaxico- 
aEaantlqea  permet  d'affiner  les 


rEsultats,  mais  crEe  quelques 
contraintes  :  temps  de  traitement 
environ  10  fois  plus  ElevE,  nEcessitE 
de  connaltre  sEmantiquement  une  partie 
importante  du  vocabulaire.  Ceci 
signifie  que  1 'adaptation  de  cet  outil 
E  une  nouvelle  base  documentaire  peut 
Etre  long. 

Afin  d'obtenir  des  temps  de  rEponse 
acceptables  (en  environnement  micro- 
inf  ormatique) ,  la  coopEration  de  ces  deux 
modules  vise  E  limiter  le  traitement 
syntaxico-sEmantique  E  l'analyse  de 
descripteurs  potentiels. 

Dans  ce  contexte,  le  premier  module  traite 
l'ensemble  d'un  document  E  indexer  ou  de 
rEsumEs  documentaires  tElEchargEs  et 
propose  une  llste  de  descripteurs  morpho- 
lexicaux.  II  s'agit  de  groupes  de  mots, 
formes  de  base  et  non  formes  dErivEes,  tels 
"rEseau  ..  connexion"  ou  "transmission 
large  bande".  Cette  approche  morpho- 
lexicale  offre  bien  entendu  une  certaine 
finesse  dans  1 ' identification  du 
vocabulaire  et  done  des  expressions  clEs. 
Par  contre,  elle  ne  permet  pas  la  mise  en 
Evidence  de  synonymie  entre  mots  ou 
expressions.  Elle  ne  peut  mettre 
automat iquement  en  rapport  des  textes 
comme  : 

-Commande  du  rEseau  de  signalisation 
-Interface  d'adaptation  de  signalisation 

Pour  Etablir  ce  type  de  comparaison,  le 
systEme  morpho-lexical  devrait  autoriser  la 
prEdEf inition  d'un  ensemble  d'expressions 
bien  trop  important  pour  etre  exploitable 
aisEment . 

Pour  chacun  de  ces  descripteurs  potentiels, 
le  second  module  effectue  ensuite  un 
traitement  syntaxico-sEmantique  de 
l'ensemble  des  phrases  le  contenant. 
L'objectif  de  ce  module  est  de  construire 
une  reprEsentation  de  la  syntaxe  et  de  la 
sEmantique  -  en  d'autres  termes:  de  la 
structure  et  du  sens  -  permettant  un  codage 
du  texte,  codage  relativement  indEpendant 
du  vocabulaire  prEc_s  utilisE. 

Ainsi  les  deux  expressions 

-Commande  du  rEseau  de  signalisation 
-Interface  d'adaptation  de  signalisation 
ont  une  partie  commune  : 

-Moyen  de  modification  de  signalisation 
qui  peut  Etre  automatiquement  gEnErEe. 
Cependant,  la  dEtermination  de  cette  partie 
commune  Etant  une  opEration  relativement 
complexe  et  longue,  il  convient  de  la 
limiter  E  des  expressions  stat istiquement 
importantes . 

En  rEsumE,  cette  approche  E  deux  niveaux 
permet  de  conserver  des  temps  de  traitement 
raisonnables  tout  en  effectuant  sur  des 
points  particuliErement  ciblEs  une  analyse 
fine. 
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3.  Application  A  la  BiblioaAtrie 

La  bibliomAtrie  a  pour  objet  1 'analyse  da 
fiches  documentaires  dAjA  stockAes  dans  une 
basa  documental  re  afin  d'en  tirer  das 
tendances  et  des  informations  statistiques. 
Dans  le  cadre  d'un  pro  jet  menA  par  un 
organisme  public  de  documentation,  une 
analyse  bibliomAtrique  est  menAe  selon  le 
protocole  suivant  : 

-  TAlAchargement  sur  un  micro-ordinateur 
d'un  ensemble  limits  de  fiches 
documentaires  (environ  100)  au  moyen 
d'une  interrogation  classique. 

-  Extraction  de  descripteurs  par  analyse 
des  champs  textuels  de  ces  fiches  afin 
de  completer  le  champ  Mots-CIA. 

-  Analyse  statistique  de  ces  fiches  en 
prenant  pour  base  les  mots  clA  et  les 
descripteurs  additionnels . 

Le  traitement  du  langage  naturel  permet 
d'affiner  1' At ape  d'extraction  de 
descripteurs.  Sur  ce  thAme,  des  tests 
d ' integration  sont  actuellement  menAs  sur 
la  base  TELEDOC. 

Certaines  caractAristiques  des  fiches 
documentaires  tAlAchargAes  orientent 
fortement  les  traitements  linguistiques . 
Tout  d'abord,  les  champs  textuels  contenus 
dans  les  fiches  Atant  rAdigAs  par  des 
analystes,  et  non  pas  par  les  auteurs,  le 
style  de  redaction  est  frAquemment 
tAlAgraphique,  marquA  par  des  phrases 
courtes,  AnumAratives  ou  sans  verbe . 
Ensuite,  les  fiches  Atant  tAlAchargAes  a 
1' issue  d'une  session  de  consultation, 
elles  traitent  d'un  domaine  assez 
restrelnt . 

Enfin,  les  fiches  Atant  le  rAsultat  de 
1' indexation  de  documents,  les  termes 
contenus  dans  le  champ  RAsumA  et 
appartenant  A  un  thesaurus  sont 
gAnAralement  Indus  dans  le  champ  Mots-clA. 

De  ces  caractAristiques,  l'on  peut  tirer 
deux  rAgles  de  conduite  : 

-  II  est  inutile  d'utiliser  un  thesaurus 
du  domaine  technique  pour  rechercher 
des  mots-clA. 

-  Combiner  une  analyse  morpho-lexicale 
et  des  traitements  statistiques  doit 
conduire  A  des  rAsultats  pertinents. 
L'appel  A  l'analyseur  syntaxico- 
sAmantique  peut  done  Atre  tr*s  limit*. 

Ainsi  sur  un  ensemble  de  100  rAsumAs 
documentaires  extraits  de  la  base  de 
donnAes  TELEDOC  sur  le  thAme  de  :  la 
commutation  dm  donnAes  et  lea  transmissions 
par  satellite,  le  programme  informatique 
propose  un  ensemble  d'une  centaine  de 
descripteurs  commengant  par  : 

fibre  optique  51 

rAseau  local  47 


systAme  tAlAcommunication  14 
modulation  dAplacement  13 
transmission  numArique  12 
bande  base  11 
service  offrir  10 
signal  bruit  10 
rAseau  numArique  9 
rAseau  connexion  9 
systAme  transmission  9 
liaison  montante  9 
modulation  MDP  9 
rAseau  commutation  8 
liaison  satellite  7 
transmission  donnAe  7 
numArique  satellite  7 
rAseau  commutA  7 


Diverses  stratAgies  sont  en  test  A  SELISA 
afin  de  choisir  les  descripteurs  A  affiner 
par  une  analyse  systAmatique  de  toutes  les 
phrases  les  contenant.  Les  trois  plus 
prometteuses  sont  les  suivantes  : 

-  Les  descripteurs  apparaissant  plus 
d'un  certain  nombre  de  fols  (entre  5  et 
10)  peuvent  Atre  analysAs , 

-  Les  descripteurs  appartenant  dAjA  au 
thAsaurus  ne  nAcessitent  pas  d'analyse 
compl Ament air e . 

-  Les  descripteurs  prAsentant  une 
structure  grammaticale  simple  :  NOM  + 
ADJECTIF  ne  nAcessitent  pas  d'analyse 
complAmentaire . 

Compte  tenu  de  ces  stratAgies,  1 'analyse  de 
1 'ensemble  des  phrases  contenant  les 
descripteurs  sAlectionnAs  conduit  A  un 
nouveau  jeu  de  descripteurs  dont  les  mots 
constitutifs  peuvent  Atre  diffArents.  Ces 
nouveaux  descripteurs  reflAtent  une 
structure  syntaxico-sAmantique  et  non  plus 
une  proximitA  de  mots  dans  une  phrase. 

Dans  1' example  prAcAdent,  ceci  se  traduit 
par  la  crAation  des  expressions  : 

transmission  numArique  par  satellite 
rAseau  commutA 
A  la  place  de 

numArique  satellite 
rAseau  commutation 
rAseau  commutA 

En  conclusion,  le  module  de  traitement 
syntaxico-sAmantique  vient  modifier,  Apurer 
et  enrichir  la  liste  des  descripteurs 
trouvAs  en  premiAre  analyse.  Ces 
descripteurs  sont  alors  affectAs  aux 
diverses  fiches  documentaires.  Pour  ce 
faire,  un  fichier,  initialement  une  copie 
du  fichier  tAlAchargA,  est  enrichl  d'un 
champ  DESCRIPTEUR,  ce  champ  Atant  ajoutA  A 
la  fin  de  chaque  fiche. 

Cette  approche  permet  d'insArer  simplement, 
et  de  maniAre  optionelle,  ce  module  de 
recherche  de  descripteurs  dans  une  chalne 
de  traitement  bibliomAtrique.  Les  analyses 
statistiques  peuvent  alors  Atre 
indif fAremment  appliquAet  soit  au  fichier 
tAlAchargA,  soit  au  fichier  complAtA. 
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4.  Aida  A  1 1 indexation 

L'indexation  d'un  texte  dans  une  base 
docuraentaire  passe  par  la  definition  de 
mots  clA  reprAsentatif s  du  texte. 
Actuellement,  cette  tAche  est  rAalisAe  par 
des  analystes  professionnels,  en  grande 
partie  de  maniAre  manuelle.  Dans  la  mesure 
ou  nul  ne  possAde  une  culture  universelle, 
de  nombreux  documents  techniques  restent 
obscurs  aux  analystes  et  sont  done  mal 
indexes.  De  plus,  pour  les  bases  associees 
4  des  thesaurus,  certains  descripteurs 
appartenant  au  thesaurus  sont  omis  lors  de 
l'indexation  (un  thesaurus  pouvant 
atteindre  100.000  termes) . 

En  consequence,  un  outil  base  sur  1' analyse 
du  texte  integral  permettrait  de  simplifier 
grandement  la  tAche  des  analystes. 

Contrairement  aux  fiches  documentaires,  les 
textes  concernes  par  l'aide  A  l'indexation 
(articles  techniques,  ...)  sont  structures 
et  rediges  dans  un  style  relativement 
eiabore.  Ensuite,  ces  textes  sont 
susceptibles  de  toucher  tout  theme  couvert 
par  la  base  documentaire  dans  laquelle  un 
analyste  cherche  A  les  integrer. 

Ces  deux  caractAristiques,  opposAes  A 
celles  rencontrAes  en  bibliomAtrie,  impose 
une  nouvelle  ligne  de  conduite  pour  un 
logiciel  d'aide  A  l'indexation  : 

-  La  premiere  tAche  A  accomplir  est  la 
recherche,  au  seln  du  document,  de 
descripteurs  appartenant  A  un 
thesaurus . 

-  A  1' issue  de  cette  Atape,  une  analyse 
morpho-lexicale  et  des  traitements 
statistiques  permettent  la  mise  en 
evidence  de  descripteurs  potentiels  qui 
seront  validAs  par  une  analyse 
syntaxico-sAmantique.  Si  la  recherche 
de  mots-clA  dans  le  thesaurus  est 
satisfaisante,  1 'analyste  peut  ne  pas 
effectuer  cette  seconde  Atape. 

-  Les  descripteurs  libres  obtenus  A 
1 'Atape  prAcAdente  peuvent  At  re 
introduits  dans  le  thesaurus,  la 
decision  Atant  du  ressort  de 

1' analyste. 

L'originalitA,  par  rapport  A  l'application 
bibliomAtrique  prAsentAe  en  3.,  est  done 
1 'utilisation  d'un  thesaurus  " dynamlqu e" 
comm#  premier  niveau  d'analyse.  L'objectif 
recherchA  est  que  ce  thesaurus  s'enrichisse 
suf f isamment  pour  rendre  inutile,  ou  rare, 
la  seconde  Atape  d'analyse.  Le  thesaurus 
ainsi  constituA  doit  contenir  deux  types 
d'informations  : 

-  L* organisation  du  descripteur  comma 
groups  de  mots,  e.g.  "transmission 
numArique  satellite". 

-  La  structure  syntaxico-sAmantique 
associAe  A  ce  descripteur.  Cette 
structure  permettra  de  gArer  les 


synonymies  de  mots  ou  de  groupes  de 
mots . 

Dans  cette  optique,  SELISA  realise 
actuellement  une  extension  du  logiciel 
utilise  en  bibliomAtrie.  En  l'Atat  actuel 
des  dAveloppements,  1* Atape  de  recherche 
des  descripteurs  appartenant  au  thesaurus 
est  achevAe  et  interfacAe  avec  le  module 
bibliomAtrique.  Par  contre  1 'Atape 
d'enrichissement  du  thesaurus  n'est  pas 
encore  implAmentAe  (fin  prAvisionnelle  en 
DAcembre  90)  .  La  version  actuelle  permet 
toutefois  de  se  rendre  compte  de  la 
complAmentaritA  de  la  recherche  de 
descripteurs  par  comparaison  avec  un 
thesaurus  d'une  part,  et  par  analyse 
linguistico-statistique  d'autre  part. 

Quel  que  soit  l'intArAt  d'un  logiciel 
d'aide  A  l'indexation,  il  convient  de 
garder  A  1' esprit  que  son  domaine  de 
validitA  est  limitA  A  des  documents  stockAs 
inf ormatiquement  sous  forme  "caractAre"  et 
non  pas  "image".  Ceci  impose  de  : 

-  Soit  traiter  exclusivement  des 
documents  issus  de  traitements  de 
textes  ou  d'outils  de  redaction 
inf ormatisAe. 

-  Soit  mettre  en  oeuvre  un  systAme  de 
saisie  des  textes  A  indexer.  Les  deux 
solutions  possibles  sont  la  saisie 
manuelle  ou  un  scanner  associA  a  un 
systAme  de  reconnaissance  des 
caractAres.  Quelle  que  soit  la  solution 
retenue,  il  s'agit  1A  d'une  operation 
relativement  lourde  et  complexe  a 
gArer . 

Dans  ces  conditions,  l'aide  A  l'indexation 
apparalt  comme  un  service  en  devenir,  dont 
1 'expansion  sera  essentiellement  liAe  A 
celles  de  la  micro-informatique  et  des 
messageries  elect roniques . 


Conclusion 

AprAs  plusieurs  decennies  pendant 
lesquelles  le  traitement  du  langage  naturel 
a  AtA  considArA  exclusivement  comme  un 
sujet  de  recherche,  des  dAbouchAs  pratiques 
apparaissent,  en  particulier  en 
informatique  documentaire.  Nous  avons 
prAsentA  ici  deux  applications,  une  en 
phase  d'intAgration  et  de  test  pour  des 
traitements  bibliomAtriques,  l'autre  en 
phase  de  dAveloppement  pour  l'aide  A 
l'indexation.  Il  faut  Agalement  noter 
qu'une  application  tout  aussi  importante  en 
informatique  documentaire  serait 
1 'interrogation  en  langue  naturelle  des 
serveurs  ou  des  bases.  Par  extension,  ce 
thAme  concerne  Agalement  1'accAs  aux  bases 
de  donnAes  non-document aires. 

Toutefois,  une  des  limites  essentielles  de 
ces  applications  est  la  nAcessitA 
d'utiliser  une,  et  non  plusieurs,  langue. 
Ainsi,  les  applications  pratiques 


pr6sent6es  pr6c6demment  ont  6t6  congues 
pour  le  Frangais  et  ne  permettent  pas 
d'utiliser  des  serveurs  documenCaires 
anglo-am6ricains . 
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BEYOND  MACHINE  TRANSLATION 


by  L.  ROLLING 

Computerized  language  processing  unit 
Commission  of  the  European  Communities 
Jean  Monnet  Building  B4/29  •  L-2920  Luxembourg 


The  author  describes  the  essential  aspects  of  computer  translation  :  design,  funding,  development, 
experimental  use,  evaluation  and  implementation.  He  emphasizes  the  technical,  economic  and 
psychological  obstacles  to  be  overcome  for  efficient  use  of  the  new  instrument. 

Current  applications  will  be  supplemented  by  other,  new  applications  in  the  MT  market,  which 
may  well  be  centered  on  commercial  communication  and  data  base  access. 

While  the  technical  feasibility  and  the  economic  viability  of  MT  depend  on  system  design  and 
implementation  infrastructure,  the  quality  of  MT  output  ultimately  depends  on  the 
representativity  of  available  grammars  and  dictionaries  vis-i-vis  the  real-life  use  of  human 
language. 

But  MT  is  not  the  only  application  that  requires  a  complete  mastery  of  a  number  of  languages  by 
computer.  Publishing,  data  base  maintenance  and  interrogation,  and  many  other  natural-language 
processing  activities  require  all-encompassing  text  corpora,  multi-purpose  lexica,  term  banks  and 
other  linguistic  and  computational  resources. 

The  EC  Commission  has  a  major  r61e  to  play  in  the  coordination  of  European  efforts  to 
standardize,  develop  and  manage  these  resources,  to  be  used  also  for  speech-technology  and 
knowledge-based  applications  of  the  future.  EC  action  includes  a  number  of  ESPRIT  and 
IMPACT  projects,  but  the  coordinating  and  teaching  tasks,  now  under  the  Commission’s 
Multilingual  Action  Plan,  may  well  be  handled  by  a  new  agency  to  be  created  for  this  purpose. 


1.  ESSENTIAL  ASPECTS  OF  MT 

Automatic  translation  was  invented  one  generation  ago  and  it  has  attained  today  a  level  of 
quality  that  allows  us  to  say  that  it  is  here  to  stay,  in  spite  of  the  declarations  of  pessimists 
who  say  that  it  is  not  automatic  because  the  results  require  some  monitoring  on  behalf  of  the 
end  user. 

If  you  ask  a  user  why  he  uses  MT,  most  of  the  time  he  will  not  mention  the  translation 
quality  or  even  the  reduced  cost,  but  he  will  refer  to  the  increase  in  speed.  What  it  took  a 
human  translator  two  weeks  to  provide,  he  can  now  obtain  within  ten  minutes  after  pushing 
a  button  (or  ten  seconds  if  he  uses  Minitel  for  the  translation  of  a  short  paragraph). 

However,  quality,  low  cost  and  speed  are  not  enough.  What  the  end  user  needs  is  the  push¬ 
button  device,  Le.  the  compatible  equipment  that  allows  him  to  input  the  source  text,  be  it  on 
an  ASCII  diskette  or  on  paper,  in  commercial  print  format  or  in  low  quality  typewriter 
script,  that  allows  him  to  steer  the  operation  towards  the  desired  target  language  and  format, 
using  the  terminology  of  the  relevant  subject  field,  and  through  the  mainframe  and  bade  to 
his  printer. 


The  need  for  these  indispensable  tools  was  often  ignored  by  the  supplier,  and  the  user  was 
told  that  his  secretary  had  only  to  type  in  the  source  text,  that  he  had  only  to  sit  at  the  screen 
and  answer  the  system’s  disambiguation  questions,  and  that  he  had  only  to  provide  a  printer 
capable  of  producing  acceptable  output 

Today  there  are  quite  a  number  of  MT  suppliers,  and  they  have  become  aware  of  the  need 
not  only  to  supply  the  user  with  an  efficient  software  product,  but  to  help  him  to  install 
compatible  hardware,  to  introduce  his  specific  terminology  into  the  dictionaries,  to  make  his 
work  station  as  user-friendly  as  possible  and  to  make  the  best  use  of  his  feedback  for  system 
improvement. 

The  suppliers  have  realised  that  they  can  only  sell  or  license  their  systems  if  the  first  users 
are  satisfied  and  let  it  be  known.  If  they  want  to  have  a  large  number  of  satisfied  clients,  they 
have  to  take  account  of  a  variety  of  requirements,  concerning  source  and  target  languages, 
specific  terminologies,  and  work  stations  adapted  to  the  users’  computer  infrastructure  and 
global  environment. 

All  this  requires  considerable  investments,  and  the  suppliers  should  see  to  it  that  the 
development,  marketing  and  operating  costs  are  reduced  as  far  as  possible. 

One  way  to  achieve  this  is  through  stratified  or  multi-level  dictionaries.  Dictionaues  can  be 
subdivided  into  (dominant)  personal  or  local  dictionaries  and  (default)  universal 
dictionaries;  frequently  occurring  words  can  be  given  precedence  over  occasional 
terminology,  etc.... 

It  is  also  possible  today  to  automatically  produce  multilingual  dictionaries  starting  from 
equivalent  texts  in  several  languages  such  as  the  EC  Official  Journal,  the  proceedings  of  the 
European  Parliament  and  Canadian,  Belgian  and  Swiss  legislation. 

(see  Table  1) 


Q  SYSTEMS 


The  first  MT  systems  were  developed  in  the  sixties  on  mainframe  computers  that  were  less 
powerful  than  today’s  personal  computers.  Many  have  disappeared  since,  and  the  only 
success  stories  concern  SYSTRAN,  LOGOS,  METAL  and  SPAN  AM.  Systran  users  indude 
Xerox,  the  US  Air  Force,  NATO  and  the  EC  Commission. 


LOGOS  is  used  by  a  number  of  Canadian  organizations  and  SPANAM  is  owned  by  PAHO, 
the  Pan  American  Health  Organization.  METAL,  designed  by  Texas  University,  is  now 
marketed  by  Siemens. 

Quite  a  number  of  systems  were  developed  later  for  the  emerging  generation  of  desktop 
computers.  They  are  interactive  rather  than  autonomous,  Le.  the  user  has  to  be  present  to 
help  the  system  solve  ambiguities.  Other  systems  such  as  ALPS  and  INK  are  just  translators’ 
aids. 


TITUS  and  TAUM  are  specific  in  that  they  require  restricted  syntax  and  terminology. 

Much  remains  to  be  said  about  Japanese  research  efforts,  which  have  already  led  to  almost  a 
dozen  operational  systems  for  translating  from  and  into  Japanese,  and  about  European 
research  projects,  such  as  the  European  Community’s  EUROTRA  project  and  two  Dutch 
initiatives  launched  by  Philips  and  BSO. 
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The  most  interesting  initiative,  however,  emanates  from  the  USA,  where  Camegie-Mellon 
University  is  plunging  into  the  unexplored  depths  of  semantic  (as  opposed  to  morpho- 
syn  tactic)  analysis. 


(see  Table  2) 


3. 


jttitu: 


E  OUTLOOK 


Looking  at  today’s  literature  on  the  subject,  one  can  spot  a  number  of  fascinating  ideas. 
Artificial  intelligence,  applied  to  MT,  will  solve  the  semantic  problem  of  ambiguity 
resolution,  as  soon  as  the  necessary  knowledge  bases  come  into  existence.  Neural  networks 
will  allow  for  parallel  and  interlinked  computing,  so  that  the  most  cumbersome  analysis 
systems  will  be  speeded  up  by  several  orders  of  magnitude.  And  speech  technologies  linked 
to  MT  will  develop  a  "telephone  interpreter"  by  the  year  2000. 

How  should  we  take  such  pronouncements?  Well,  that  major  progress  of  these  types  never 
occurs  suddenly,  ”en  bloc". 


Many  experts  will  spend  many  years  developing  viable  knowledge  bases,  and  unfortunately, 
machine  translation  might  well  not  be  their  major  priority,  llie  same  is  true  for  neural 
networks  and  parallel  computing  :  once  these  technologies  have  been  developed  for  other, 
more  important  projects,  MT  specialists  might  start  to  learn  how  to  apply  them  to  their  own 
environment.  Voice  technologies  have  a  closer  link  to  MT.  Speech  generation  is  technically 
viable  today,  but  speech  recognition  and  understanding  is  at  least  one  decade  away. 
Fortunately,  it  will  be  possible  to  design  interactive  systems,  allowing  a  speaker  to  monitor 
his  utterances  on  a  screen  before  launching  them  into  the  MT  system. 

Both  voice  analysis  and  synthesis  require  very  large  phonological  dictionaries  and  speech 
data  banks  for  the  development  of  multi-speaker,  speed-  and  noise-independent  systems. 

In  the  meantime,  we  have  to  do  a  lot  of  dirty  work  that  does  not  require  intelligent  research 
capabilities,  but  consistent  efforts  by  competent  linguists  and  programmers. 

Many  systems  that  had  been  developed  under  sophisticated,  academic  programming 
languages  like  LISP  or  PROLOG,  will  have  to  be  converted  to  more  efficient  devices; 
MS-DOS  systems  will  be  transferred  to  UNIX,  ALGOL  systems  to  ADA,  etc . 


Electronic  dictionaries  developed  for  publication  and/or  NLP  applications  can  be  made  re¬ 
usable  for  MT  systems.  Mono-  and  multilingual  text  corpuses  can  be  exploited  to  produce 
terminology  and  equivalent  sublanguage  patterns,  making  development  and  use  of  MT 
cheaper  and  cheaper,  while  human  translators,  in  insufficient  numbers  to  perform  the  tasks 
that  are  definitely  not  for  MT  (literature,  publicity,  speeches),  will  increase  their  prices, 
making  MT  even  more  competitive. 


Translations  will  be  stored  in  large  corpuses,  and  retrieval  of  already  translated  text  chunks 
will  become  competitive  as  repetitive  translation  of  everyday  texts  increases  over  the  years. 

Improved  MT  work  stations  will  be  developed  to  give  the  operator  access  to  previous 
translations,  to  lexical  and  terminological  resources  not  yet  available  in  the  MT  dictionaries, 
and  to  allow  him  to  produce  a  variety  of  products  and  services  in  various  languages,  formats 
and  scripts,  so  as  to  satisfy  an  increasing  array  of  impatient  customers. 


(see  Table  3) 


Table  1 


ESSENTIAL  ASPECTS  OF  M.T. 


1.  DESIGN  -  Direct  (bilingual) 

-  Transfer 

•  Interlingua 

-  Pivot  language 


2.  ECONOMICS  '  Market  (open,  hidden) 

-Funding  (public, private) 
-  Viability  threshold 
*  Maintenance  cost 


3.  EVALUATION  'Quality  (revision rate) 

-  Speed  (CPU,  turnaround) 
'Cost  (raw, post-edited) 

'  User-friendliness 


4.  IMPLEMENTATION  -  Input  (OCR,  Spell  check) 

'  Text  typology  (correctness) 

-  Computing  /  Networking 
-Post-editing  (format, replace) 


5.  HUMAN  ASPECTS  -  Development 

•  Training  (authors,  users) 

•  Acceptance  /  Promotion 
-  Management  (feedback) 


6.  FUTURE  PLANNING 


-Resources  (corpora, lexica) 
-Automation  f extraction, retrieval) 
-  Speech  input/output 


-YOU  NEVER  KNOW 
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Table  2 


EXISTING  SYSTEMS 


LOGOS 

SYSTRAN 

PAHO 

METAL 


Desktop  systems : 

Task-spedflc  systems : 

Translators’  aids : 

Japanese  systems : 


SMART 

WEIDNER-BRAVICE 

GLOBAUNK 

D’AGOSTINI 

TOVNA 


TITUS 

TAUM-Meteo 


ALPS 

ERICSSON 

INK 


ATLAS  (Fujitsu) 
fflCATS  (Hitachi) 

MU  (Kyoto- JICST) 
TRANSAC  (Toshiba) 
MELTRAN  (Mitsubishi) 
etc. . 


Research  protects  :  EUROTRA  (EC) 

BSO/DLT 

ROSETTA 

CARNEGIE-MELLON 
+  Japan  +  IBM  ... 
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Table  3 


Language  Engineering  Programing 

Preparatory  period  :  1989*91 
Programme  period  :  1992-94 


Current  activities 

-  Inventory  of  current  products  and  services,  research  activities  and  projects. 

-  State-of-the-art  studies  for  various  subsectors  of  language  industry. 

-  Economic  impact  studies  for  NLP,  MT  and  speech  technologies. 

-  Definition  of  priorities  for  coordination  :  common  formats  and  standards. 

-  Development  and  coordinated  exploitation  of 

.  representative  text  corpora, 

.  lexical  resources, 

.  terminological  resources, 

.  software  and  hardware  products, 

.  phonological  dictionaries  and  speech  data  banks. 

-  Creation  of  a  European  Institute  for  Language  Engineering. 
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THE  ROLE  OF  INTELLIGENT  ONLINE  INTERFACES  TO  BRIDGE  THE 
COMMUNICATION  GAP 

«>y 

A  VICKERY 

TOME  ASSOCIATES  LIMITED 
IMO  House 
222  NorthfWd  Avenue 
LONDON,  W13  9SJ 


ABSTRACT  1.  INTRODUCTION 

People  are  curious  from  nature,  want  to  know  more  The  number  of  databases  and  electronic  information 
than  they  do,  have  an  insatiable  gluttony  for  services  publicly  available  in  Europe,  USA  and 

information,  whether  in  their  pursuit  of  private  aims  elsewhere  is  now  large  (4300  according  to  Cuadra) 
or  in  their  professional  career.  The  best  carrier  of  and  continues  to  grow.  But  actual  usage  of  online 
information  Is  person  to  person  communication.  A  databases  remains  low  in  comparison  to  the  number 
postgraduate  student  wM  seek  information  from  his  of  professionals  of  all  kind  which  could  benefit  from 
tutor  or  professor;  apprentice  -  from  his  factory  information  that  is  accessible.  Below  we  shaH 

supervisor;  a  working  engineer  from  his  colleagues.  consider  steps  to  be  taken  by  a  user  to  carry  out  a 
A  smal  proportion  of  people  try  to  get  information  successful  search.  In  all  these  steps  one  can  see 

by  reading  and  even  a  smaller  number  wli  try  to  barriers  which  may  prevent  the  user  from  using 
access  online  databases.  online  searching. 

And  yet,  very  great  deal  of  technical  and  professional  2.  THE  ONLINE  SEARCH  PROCESS 
communication  today,  Is  mediated  through 

documents,  or  even  more  indirectly,  through  The  databases  to  which  electronic  information 

computer  systems.  Communication  gaps  between  services  give  access  are  of  various  kinds.  A  useful 

man  and  system  are  just  as  real  and  important  as  categorisation  has  been  provided  by  Staud  (1968), 

any  other,  and  It  is  this  kind  of  gap  that  wN  be  who  recognises  the  foliowing  types: 

considered  here. 

1.  The  factual  databases 

The  paper  wti  first  enumerate  the  difficulties  which  Statistical,  with  processing  facilities 

arise  in  accessing  computer-based  Information:  Quasi-statist ical,  tables  but  no  statistical 

processing 

1 .  the  language  barriers  such  as  databases  in  Textual  facts,  including  referral,  directories 

different  national  languages,  concepts  having  formalisms  (eg.  chemical  structures),  models 

different  meanings  in  different  databases  (or  parts  of  2.  Textual 

the  same  database),  variations  in  command  or  query  Full-text 

languages.  Bibliographic 

3.  Integrated,  eg.  combining  textual 

2.  the  IntsRectual  difficulties,  i.e.  the  gap  in  tacts,  tables  and  a  bibliographic  reference 

knowledge  which  exists  in  the  searcher’s  mind 

during  the  stage  of  search  formulation,  the  Whatever  the  type  of  database,  access  in  nearly  aN 

misunderstandings  which  can  arise  during  the  cases  to  its  data  is  via  an  index  that  consists  of 

human/human  communication  (if  the  search  is  done  words,  phrases,  names,  codes,  class  numbers, 
by  an  Intermediary),  and  errors  arising  from  numerical  identifiers,  citations  or  other  elements  in 

human/computer  communication  during  the  search  the  records.  Each  database  consists  of  records  and 

process.  al  records  have  fields. 

3.  the  technical  barriers  in  achieving  a  satisfactory  Search  is  primarly  carried  out  on  the  data  contents 

search  reetti.  Is.  In  communication  with  various  of  the  record  within  Its  various  fields.  Search  terms 
hosts  and  many  databases.  In  using  can  be  single  words,  phrases  or  codes.  These  can 

telecommunication  links,  In  different  techniques  In  be  combined  into  Boolean  structure  by  the  use  of 
Interrogating  fles.  In  different  indexing  methods,  In  the  operators  AND,  OR,  NOT.  A  simple  use  of  the 
variations  in  structure  of  vocabularies,  in  structure  of  textual  discourse  is  represented  by 

ctassffioation.  proximity  operators;  words  can  be  sought  that  are 

adjacent  In  the  text;  or  near  each  other;  or  in  the 

The  conclusions  of  the  paper  wtt  summarise  the  same  Reid  etc.  Some  use  is  also  made  of  the 

current  achievements  in  overcoming  the  barriers  to  semantics  of  individual  words  -  words  with  common 
Information  for  online  databases  and  the  problems  elements  of  meaning  can  be  conflicted  by  truncation 

which  sti  need  solutions.  (right-hand,  left-hand,  or  internal).  Search  may  be 
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limited  to  records  in  a  certain  language  or  a  certain 
date. 

To  undertake  a  search,  the  user  must  take  decisions. 
Below  are  listed  the  most  Important  ones: 

1.  The  user  must  have  access  to  a  terminal  (or 
microcomputer  acting  as  such) 

2.  The  microcomputer  must  be  linked  via 
telecommunication  to  a  variety  of  mainframe 
(or  minis)  online  hosts,  on  which  are  mounted 
databases.  For  this  purpose  the  user  must  have 
a  signed  contract  with  the  national 
telecommunication  agency  and  the  hosts  s/he 
wMuse. 

3.  The  user  must  be  aware  of  question  s/he  wants 
to  put  to  die  system.  It  is  not  such  an  easy  task 
to  formulate  a  query  hi  the  area  In  which  a 
knowledge  gap  exists  hi  the  user's  mind.  The 
answer  to  the  query  must  M  hi  this  gap. 

4.  The  user  must  know  the  databases  s/he  wants 
to  interrogate  and  the  hosts  on  which  they  are 
avalabie. 

5.  Query  is  formulated  in  the  vocabulary  suitable 
for  the  selected  databases  (this  may  Involve  use 
of  a  thesaurus). 

6.  Query  is  expressed  as  a  search  statement  in  the 
format  required  by  the  selected  hosts,  using 
Boolean  and  other  search  operators  (proximity, 
truncation,  field  restriction,  limits). 

7.  Selected  host  is  dialed  up  via 
telecommunications  and  logged  on. 

8.  Selected  database  fie  Is  entered. 

9.  Search  statement  Is  transmitted  to  host,  using 
appropriate  command  language  to  Instruct  the 
mainframe  computer  what  to  do.  (Command 
language  for  each  host  is  dlferent) 

to.  Search  output  Is  presented  to  user  in  selected 
format. 

11.  If  search  output  Is  not  acceptable  to  the  user, 
search  statement  is  amended  and  processed 
again  as  In  9  and  10. 

12.  Switching  takes  place  between  databases,  if 
required. 

13.  Search  output  is  delivered  to  user  (printed  or 
downloaded). 

14.  Further  hosts  may  be  acceseed. 

13.  Documents  may  be  ordered  online. 

From  the  above  list  of  functions  performed  during  an 
online  search,  >  becomes  evident  how  many  skis 
must  be  acquired  before  a  successful  result  can  be 
achieved.  An  Information  Intermediary  can  help  the 
user  In  many  aspects  of  the  problem.  But  not  each 
institute  can  afford  to  employ  enough  of  these  sklled 
practitioners  to  satisfy  Its  workers  and  there  are 
many  medium  and  smal  firms  which  have  not  even 
one  single  Intermediary. 

The  concept  of  an  Tnteilgsnt  Interface'  tries  at  least 
lo  acme  aasntichsfpthe  end-users.  An  'InteWflent 
Hahce'  Is  a  software  package  that  Is  Interposed 
between  the  searcher  and  die  database  system,  and 
that  can  proMde  the  user  wth  some  (ideally  ad)  of 


the  help  that  an  intermediary  gives.  What  do  we 
mean  by  the  word  'irttefligent'TItisusedtomean 
any  software  package  that  replaces  any  action 
normally  undertaken  by  an  information  intermediary 
In  online  database  search  .  These  actions  can  range 
from  the  purely  clerical ,  such  as  automatic  dial-up  of 
telephone  numbers,  to  the  futy  intellectual,  such  as 
selecting  the  best  way  of  modifying  an  unsuccessful 
search  query  or  questioning  the  user  to  clarify  an 
input  query. 

Let  us  see  how  an  intelligent  interface  can  help  the 
end-user  in  the  previously  described  functions. 

A  number  of  Intel)  igerrt  interfaces  have  been 
developed  over  the  last  few  years,  most  of  them  only 
to  the  prototype  stage  (B.C. Vickery,  1969).  I  shall  be 
mentioning  one  or  two  of  these,  but  most  of  my 
comments  w*  be  Mustrated  with  reference  to  a 
product  with  which  I  am  associated,  the 
TOME. SEARCHER.  Its  procedures  are  being 
improved  and  incorporated  in  MIT),  an  European 
Community  project  in  the  IMPACT  programme. 

3.  HELPING  WITH  USER  QUERIES 

The  user  inputs  a  query  to  the  system.  An  'intelligent’ 
system  should  assist  the  user  in  3  aspects: 

0).  The  best  mode  of  communication  between  the 
end-user  and  the  machine,  in  either  direction, 
would  be  a  natural  language  interface.  Can  It  be 
easly  achieved? 

(if).  Is  the  search  formulated  in  a  satisfactory  way 
so  that  the  system  can  supply  the  user  with  an 
efficient  answer? 

(Hi).  User  models  can  enable  the  system  to  take  Into 
account  the  differing  needs,  aktis,  knowledge 
and  expertise  of  groups  of  users  or  even 
individuals.  How  far  did  we  go  in  development 
of  good  user  models  in  the  human-computer 
Interfaces? 

(0  Natural  language 

Users  find  It  much  easier  to  express  their  needs  in 
their  own  language  than  to  learn  a  controlled 
language  such  as  many  indexers  use.  When  a 
natural  language  Interface  Is  envisaged  for  an 
information  system,  this  Interface  must  be  able  to 
translate  the  natural  language  statements  into 
appropriate  terminology  used  by  the  system. 

Speaking  about  natural  language  interfaces,  It 
should  be  realised  that  what  Is  meant  Is  just  a  subset 
of  a  natural  language,  a  subset  which  wti  closely 
correspond  to  the  domain  covered  by  the 
information  system. 

The  activity  of  processing  natural  language  is  usually 
divided  Into  four  phases,  morphological,  syntactic, 
semantic  and  pragmatic. 


The  morphology  of  a  language  has  to  do  with  the 
make-up  of  words,  and  in  particular  with  the  suffixes 
and  prefixes  that  can  be  combined  with  a  given  'root' 
word:  so  the  root  Towe’  can  appear  as  loves', 
loved’,  Tower',  'unloved'.  Many  of  these  prefixes  and 
sufRxee  are  subject  to  regular  rules  and  these  rules 
can  be  utlised  In  language  analysis. 

The  syntax  of  a  language  is  concerned  with  the  ways 
In  which  words  are  combined  Into  larger  units  - 
phrases,  clauses,  sentences.  Words  play  various 
roles  In  a  sentence,  the ’parts  of  speech’- nouns, 
verbs,  adjectives,  prepositions  and  so  on. 

Semantics  is  concerned  with  the  underlying  meaning 
of  text.  To  understand  a  sentence  fully,  it  is 
necessary  to  grasp  not  Just  tiie  grammatical  role  of 
each  word,  but  also  Its  semantic  role. 

Lastly  we  come  to  pragmatics.  This  is  concerned 
with  the  context  in  which  a  particular  linguistic 
statement  occurs. 

The  processing  of  user  queries  input  to  an 
information  system  requires  morphological, 
syntactic,  semantic  and  pragmatic  analysis. 

A  system  accepting  free  expression  of  a  user  query 
must  process  it  so  as  eventually  to  create  a  search 
statement  The  input  string  is  first  separated  Into 
words.  Most  systems  for  searching  electronic 
information  services  remove  non-  significant  words 
by  scanning  against  a  stoptist  Stopping  is  usually 
folowed  by  stemming  and  the  ensuing  processing  is 
carried  out  with  the  stems. 

Compound  terms  are  often  identified  by  matching 
input  against  a  dictionary,  lexicon  or  Index. 
TOME.SEARCHER.  as  wei  as  identifying 
compounds  that  occur  in  its  dictionary,  also  employs 
semantic  rules  to  recognise  and  create  other 
compounds  appearing  In  the  input. 

In  each  text  one  can  find  words  which  have  more 
than  one  meaning,  eg.  cel;  is  It  a  biological  cel,  an 
electric  cel  or  a  prison  cel.  These  multimeanings 
must  be  disambiguated  and  some  system  as  CIRCE 
do  I  by  asking  the  user,  some  try  to  develop  rules 
(eg.ERU). 

TOME.  SEARCHER  usee  several  methods  including 
checking  a  muWmeanlng  term  against  the  context, 
that  is  the  subject  area  in  which  the  search  is  to  be 
carried  out  TOME.  SEARCHER  aieo  adds  synonyms 
to  terms  in  the  dictionary,  so  that  they  can  be 
employed  in  search  strategies. 

So  fsr  we  have  spoken  of  language  processing  In 
one  language  only.  But  one  of  the  more  serious 
barriers  to  communication  is  the  inabUty  to 
understand  a  foreign  language.  We  can  break  this 
harder  by  introducing  a  muMNngual  interlace  to 
databases.  The  characteristics  of  such  an  Interface 
could  be  ae  follows: 


(1)  screen  displays  can  be  avaiabie  in  aM  the 
languages  covered  by  the  system;  this  would 
include  the  process  of  database  selection. 

(2)  the  interface  can  accept  input  of  user  query  in 
each  of  these  languages  and  refine  the  query 
by  interacting  with  the  user  in  the  language  of 
input. 

(3)  the  terms  in  the  refined  query  could  be 
translated  into  the  language  of  the  selected 
database  immediately  before  formulation  of  a 
Boolean  search  statement 

(4)  a  final  bonus  would  be  the  translation  of 
retrieved  records  into  the  language  of  the  user, 
though  here  we  are  going  beyond  the 
immediate  functions  of  a  search  interface  and 
into  text  translation  program. 

The  first  three  features  wfll  be  introduced  into  the 
MITI  interface  previously  mentioned. 

(ii)  Search  formulation 

Everybody  who  ever  worked  with  users  knows  how 
important  it  is  to  clarify  the  search  query  posed  by 
the  inquirer.  In  a  discourse  between  the  user  and  a 
knowledgeable,  skflled  intermediary  there  is  a 
process  of  knowledge  enhancement  on  both  sides: 
the  user  specifies  more  tangibly  the  missing 
information  and  acquires  at  least  some  knowledge  of 
databases,  thesauri  and  vocabularies  and  the 
intermediary  gains  more  specific  knowledge  in  the 
subject  area  of  the  user.  For  both  of  them  It  is  an 
excursion  into  each  other’s  mind.  This  can  be 
represented  in  a  form  of  a  Venn  diagram. 

Can  such  communication  pattern  be  ever  achieved 
by  the  man  and  a  machine?  In  TOME.SEARCHER 
we  tried  to  create  a  modest  model  of  a  discourse, 
using  the  results  of  semantic  analysis.  The  problem 
statement  is  made  up  of  a  set  of  frames,  one  frame 
for  each  term.  Frames  consist  of  slots,  places  where 
a  particular  Item  of  information  fit  within  the  larger 
context  created  by  the  frame.  The  slots  for  each 
term  are  defined  by  semantic  categories.  By 
supplying  a  place  for  expected  information  and  thus 
creating  the  possibiity  of  recognising  that 
information  is  missing  or  incompletely  specified,  the 
slots  mechanism  permits  reasoning  based  on 
confirmation  of  expectations  -  'fling  the  slots.  ’  The 
procedure  is  as  follows:  first,  the  system  attempts  to 
fl  as  many  slots  as  possible  with  information  the 
system  already  possesses.  Then  the  system  checks 
the  frames  for  'sufficient  completeness'.  Le.  the 
minimum  Information  about  a  particular  concept 
required  by  the  search  strategy  construction 
process.  The  roles  specify  which  combination  of 
fWed  slots  is  necessary  and  sufficient  for  the  system 
to  carry  out  a  search,  if  it  is  determined  that  more 
information  is  required,  the  user  is  prompted  to 
supply  the  missing  information. 

This  method  of  eliciting  information  from  the  user 
worked  quite  wsl  in  a  small  system.  But  when 
similar  procedures  were  developed  for  a  big 
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database,  the  number  of  questions  to  the  users  has 
(WtrapotoKi  considerably  creating  difficulties  in 
Implementing  the  technique.  It  is  not  known  to  me  if 
a  better  system  of  eliciting  Information  from  users 
exists  In  a  working  InteMgenf  system. 

(W)  User  models 

Future  •jntoWgont’  user  interfaces  must  be  dynamic. 

It  means  that  they  must  adjust  to  user  requirements. 
Some  users  have  never  previously  used  online 
searches,  they  do  not  know  how  to  formulae  search 
strategies  or  how  to  use  the  command  language.  For 
them  the  search  process  must  be  automatic,  a  'black 
box'  with  input  and  output  mechanisms.  Other  users 
are  experienced  searchers.  They  need  to  be  given 
the  Seidblity  of  changing  a  search  ir,  whatever  way 
they  want. 

The  user  model  should  be  able  to  adjust  the  actions 
of  the ’kiteNigent’ system  to  the  individual  needs  of 
the  user.  There  are  static  characteristics  of  the  user, 
such  as  age  and  sex,  and  dynamic  characteristics 
which  wtt  change  wkh  usage  of  the  system,  with  the 
change  of  a  job  or  even  with  the  change  of  the 
project  on  which  the  user  is  currently  engaged.  A 
reasonable  approach  therefore  would  be  to  ask  the 
user  few  questions  representing  his/her  static 
characteristics  which  could  be  used  permanently  by 
the  system  and  questions  which  would  elucidate  the 
ongoing  changes  each  time  the  user  enters  the 
system. 

Despite  many  attempts  to  implement  a  user  model 
into  an  inteMgent  information  system,  there  is  little 
evidence  so  far  of  success. 

TOME  implemented  user  models  in  few  of  its 
systems.  The  best  results  have  been  obtained  by 
asking  the  user  about  Ns  location  and  language  to 
be  used  during  the  interactions  in  the  system.  The 
location  gives,  for  example,  a  display  of  libraries  and 
their  sources  of  information  close  to  the  place  of 
IMng.  The  language,  quite  understandable,  enables 
the  user  to  understand  what  is  going  on  during  the 
search.  In  an  InteMgent  tutoring  system  the  student 
model  acts  as  a  historical  record  of  what  the  student 
knows  and  what  he  needs  to  learn.  The  ablitiee  of 
the  student  can  Influence  the  presentation  of  the 
teaching  material  and  the  testing  of  his 
achisMBmsnts. 

4.  HOST  AND  DATABASE  SELECTION 

This  is  anothsr  area  in  which  an  ’intelligent’  interlace 
can  assist  the  user. 

Once  an  intsWgent  interface  begins  to  cover 
heterogeneous  databases  wth  different  contents 
and  structures,  it  becomes  necessary  to  keep 
information  within  the  Interface  on  how  to  map 
queries  to  ttwee  information  sources.  Similarly  there 
is  a  need  to  keep  information  on  the  different  hosts 
and  their  fadNUee  and  command  languages. 


Some  hosts  provide  aids  to  database  selection  -  for 
example  both  DIALOG  and  ESA/IRS  permit  the  user 
to  specify  a  broad  subject  field,  and  to  matte  a  trial 
search  of  the  databases  that  the  host  has  aHocated 
to  the  subject,  to  discover  what  output  each  wtt  yield 
on  the  search  topic.  Alternatively,  databases  can  be 
selected  by  an  interface  before  going  online.  The 
choice  of  a  subject  field  can  be  guided  by  display  of 
menus  leading  down  from  the  broadest  subjects  to 
more  specific  ones;  information  about  the  databases 
allocated  to  the  chosen  subject  can  be  displayed  to 
help  In  final  selection.  Simlar  facilities  for  database 
selection  are  offered  by  gateways  such  as 
EASYNET. 

In  TOME.SEARCHER  the  database  selection  module 
interacts  with  two  other  modules,  the  user  proflte  and 
source  data  dictionary. 

The  user  profile  besides  of  other  data  contains  the 
host  passwords  which  the  user  is  entitled  to  use;  the 
source  data  dictionary  indudes  data  on  host  (such 
as  command  language  of  the  hosts,  SDI  fadities, 
printing  formats)  and  data  on  each  database  (eg. 
type  of  database,  searchable  fields,  language  of 
database).  The  database  selection  module  supplies 
descriptions  of  databases  induding  subject 
coverage. 

From  the  search  query,  the  system  identifies  the 
terms  and  compounds  by  language  processing.  The 
classification  numbers  of  the  terms  are  then  used  to 
dimb  the  classification  of  the  system  to  converge  on 
subject  classes  that  have  been  used  in  database 
selector.  Hosts  and  databases  for  which  the  user  has 
no  password  are  eliminated;  the  remaining 
databases  are  checked  in  the  source  data  dictionary 
and  again  those  that  do  not  match  search 
specifications  (previously  coNeded  from  the  user) 
are  discarded  from  the  valid  list  of  databases.  The 
final  list  of  host  and  databases  is  displayed  for  user 
approval. 

This  way  the  user  is  helped  in  choosing  the  right 
databases,  being  sure  that  none  are  missed  out  but 
at  the  end,  the  final  choice  belongs  to  his/her 
decision. 

5.  THE  SEARCH  PROCESS 

The  process  of  creating  search  strategy  can  be 
divided  into  three  stages:  the  Initial  setting  up  of  the 
strategy,  the  modification  of  search  strategy  and  its 
translation  to  the  host  command  language. 

In  a  conventional  search  (without  an  intelligent 
interface),  the  user  needs  to  know:  the  terms,  their 
relationships  with  other  terms,  Boolean  operators 
and  the  command  language  which  can  and  usually 
does  differ  for  each  host. 

In  an  intelligent  database,  the  user  should  not  need 
this  knowledge.  The  terms  and  their  relationships  are 
determined  in  the  dictionary  of  the  system;  the 


Boolean  operators  are  implemented  mainly  by  the 
intekigence  of  the  system,  by  interaction  of  the 
system  supplemented  by  Interaction  of  the  system 
wth  the  user.  The  created  search  strategy  is  then 
translated  Into  the  command  language  of  the  host 
(stored  in  the  source  data  dictionary  mentioned 
before).  The  search  strategy  Is  then  released  to  the 
host  database(s)  and  the  yield  of  the  search  is 
presented  to  the  user.  Very  often  some 
modifications  to  the  search  are  needed  and  these 
wM  be  again  introduced  either  by  the  system  or  by 
the  user.  Some  of  the  poesfole  amendment 
procedures  are  already  avalable  on  some  hosts  -  for 
example,  ESA/IRS  has  a  ZOOM  procedure,  and 
INFOLINE  has  a  GET  procedure. 

The  interface  may  also  initiate  a  relevance  feedback 
procedure:  presenting  sample  output  from  the  first 
search  to  the  enquirer,  asking  him/her  to  evaluate 
each  item,  and  using  the  response  to  amend  the 
search  statement.  More  weight  may  be  given  to  the 
index  terms  of  relevant  kerns,  and  less  weight  to 
those  of  kerns  Judged  not  to  be  relevant. 

The  automatic  procedures  in  search  strategy  are 
carried  out  by  TOME.SEARCHER  for  inexperienced 
users.  Other  interfaces  introduce  also  some  help  in 
buiding  search  strategies,  in  most  cases  with  user 
intervention. 

6.  COMMUNICATION  PACKAGES 

There  are  many  communication  packages  in  use  in 
information  retrieval  systems.  The  intelligent  interface 
incorporates  the  communication  package  into  the 
system.  No  inkiative  of  the  user  is  necessary,  the 


search  strategies  in  the  appropriate  command 
languages  pass  smoothly  from  the  terminal  to  the 
host  computer  and  back  to  the  user  terminal. 

7.  CONCLUSION 

Intelligent  interfaces  can  offer  many  facilities  to  make 
k  easier  to  interact  wkh  online  databases.  I  have 
particularly  mentioned: 

-  helping  the  user  to  select  appropriate  databases 

-  accepting  queries  in  the  user’s  own  language, 

even  when  k  is  not  the  language  of  the 
database 

-  intelligently  processing  these  queries  to  formulate 

an  optimum  search 

-  automatically  putting  the  search  into  the  form 

required  by  the  online  system 

-  handling  automatically  all  telecommunications  wkh 

the  host  system 

An  interface  wkh  these  facilities  can  go  a  long  way  to 
bridge  the  communication  gap  between  end-users 
and  online  sources  of  information  and  overcome 
some  of  those  barriers  which  prevent  the  end-user 
to  use  frequently  and  efficiently  the  online  databases. 
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NOTE 

In  response  to  requests,  the  viewgraphs  used  during 
the  presentation  are  included  here  (pages  8-6  to 
8-9). 
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ONLINE  DATABASE  GROWTH 

I  Databases  4465 


Producers  1950 


Services  (Hosts)  645. 


•BO  '82  *84  '86  '88  '90 


SEARCH  ACTIONS  AND  DECISIONS 


TYPES  OF  DATABASE 
FACTUAL 

Statistical 
Textual  Facts 
Formalisms 

TEXTUAL 

Full  Text 
Bibliographic 

INTEGRATED 

eg  Textual  Facts 

+  Tables 

_ +  Bibliographic  References 


KNOWLEDGE  NEEDED 

1  I  SUBJECT  DOMAIN: 

Terminology 
Thesaurus  Relations 
Classification 

2  HOST  CHARACTERISTICS: 

Command  Language 
Search  Facilities 

3  DATABASE  CHARACTERISTICS: 

Subject  Scope 
Type  of  Information 
Field  Structure 
Language 
Output  Formats 

4  |  TELECOMMUNICATIONS  PROCEDURES 

5  I  SEARCH  STRATEGY: 

Query  Formulation 
Search  Modification 


SEARCH  PATTERNS 


LANGUAGE  PROCESSING 


1 


Removing  Non-Significant  Words 
from  Query  by  use  of  a  Stoplist 

Stemming  to  Remove  Plurals 
and  Other  Suffixes 

Look  up  in  Dictionary 

Questioning  User  to  Clarify 
Words  not  in  Dictionary 

5 


Creating  Compound  Terms 


6 


Resolving  the  Meaning  of 
Ambiguous  Words 


7 


Linking  Query  Terms 
to  Synonyms 


HELP  WITH  USER  QUERIES 


1. 

User  can  express  query 
in  natural  language 

2. 

System  helps  user  to 
create  a  well-formulated 
query 

3. 

User  model  guides  the 
interaction  between 
user  and  system 

FROM  QUERY  TO  SEARCH 

d  "Papers  about  the  handling  and 
|  defects  of  floppy  disks6 


STOP: 

HANDLING  -  DEFECTS  - 
FLOPPY  DISKS 

STEM: 

HANDLING  -  DEFECT  - 
FLOPPY  DISK 

COMPOUND: 

HANDLING  -  DEFECT  - 
FLOPPY  (w)  DISK 

SYNONYMS: 

(HANDLING  OR  CARE)- 
(DEFECT  OR  FAULT)— 
(FLOPPY  (w)  DISK  OR 
FLOPPY  (w)  DISC  OR 
DISKETTE) 

BOOLEAN: 

(HANDLING  OR  CARE)  OR 
(DEFECT?  OR  FAULT?) 

AND  (FLOPPY  (w)  DISK? 
OR  FLOPPY  (w)  DISC?  OR 
DISKETTE?) 
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MULTILINGUAL  FACILITIES  MODELLING  THE  USER 


1 


Query  Input  in  User 
Language 


2 


Query  Formulation  and 
Modification  in  User 
Language 


3 


Screen  Messages  in 
User  Language 


4 


Automatic  Translation  of 
Search  Terms  into 
Language  of  Database 


1 


2 


3 


CAPABILITIES 

-  Experience  in  Searching 

-  Knowledge  of  Subject 

-  Knowledge  of  Languages 


WORK  CONTENT 

-  Practical  (Engineer) 

-  Research 

-  Administration 

-  Student 

-  Amateur  Interest 
etc 


PREFERENCES  IN  CURRENT 
SESSION 

-  Type  of  Information 

-  Volume  of  Output 

-  Precise  or  Broad 


CLARIFYING  A  QUERY 


HELP  IN  SELECTING  DATABASES 


The  Use  of  Frames  and  Semantic  Categories 


QUERY  TERM:  Pruning 

CATEGORY:  Operation 

ASSOCIATED  CATEGORIES: 

Plant:? 

Tool:? 

Fill  Empty  Slots  by  Questioning  User 


Query  Input 


Look  up  Terms  in  Dictionary 
Find  their  Class  Numbers 


Climb  Classification  to  Reach 
Subjects  used  in  Database  Index 


Select  Database/s 


Refine  Selection 
Depending  on: 

IHost 

(Doe*  uter  have  patterned?) 

2 Type  of  Information 

(It  K  included  in  databaae!) 

3  Language  of  Database 

(Can  uter  under  Hand  tMt  language?) 

Offer  Final  Selection  to  User 
for  Confirmation 


>*■ 


'S* 
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BRIDGING  THE  GAP 


HELP  DURING  SEARCH 


l 


► 


1 


Helping  User  to  Select  Sources 
of  Information 


2 


Accepting  Queries  in  the 
Users  Language 


3 


Intelligently  Processing  Queries 
to  Form  an  Optimal  Search 
Statement 


4 


Automatically  Putting  the  Search 
into  the  Form  Required  by  the 
Online  Host 


5 


Handling  Automatically  all 
Telecommunications  with 
the  Host 


1 


Initial  Search  Strategy 

-  Dictionary  supplies  relations 
between  search  terms 

-  System  inserts  Boolean  and 
adjacency  operators  and 
uses  truncation 


2 


Translation  for  Host 

-  System  uses  host  command 
language  and  transmits 
search 


3 


Search  Modification 

-  Thesaurus  offers  user 
broader,  narrower  and 
related  terms 


MULTILINGUAL  FACILITIES 


SOURCE  DATA  DICTIONARY 


1 


2 


3 


4 


5 


Screen  Messages  in 
User  Language 


Query  Input  in 
User  Language 


Query  Formulation  and 
Modification  in  User 
Language 


Database  Selection  Takes 
into  Account  which 
Languages  User  Can  Read 


Automatic  Translation  of 
Search  Terms  into 
Language  of  Database 
Index 


This  win  Store  Data  about  Hosts  and  Databases 


Command  Languages 


Type  of  Database 


Searchable  Fields 


Language  of  Database  Records 


Document  and  Information  Types  Included 


Character  Sets 


I 


MM 


Database  Publishers'  Challenges  for  the  Future 

By 

Barbara  Lawrence 
Administrator 
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Technical  Information  Service 
555  West  57th  Street 
Suite  1200 
New  York,  NY  10019 


INTRODUCTION 

As  decades  and  centuries  reach  their  turning  points  it  is  useful  to  pause  a  moment  and  think 
about  the  future.  Now  is  one  of  those  times,  and  it  can  be  a  bit  frightening  for  database  publishers. 
There  are  so  many  competing  forces,  technologies,  and  user  communities  that  it  is  hard  to  forge 
our  own  vision. 


This  analysis  of  the  path  to  the  future  tries  to  keep  the  focus  on  the  role  of  database  publishers. 
As  an  aid  to  achieving  this  focus,  this  paper  takes  as  its  theme  some  recent  words  from  Everett 
Brenner,  long  a  gadfly  of  the  database  and  secondary  information  service  community. 

"The  success  of  information  systems  using  computers,  whether  in  research  or  business 
areas  of  the  corporation,  rarefy  ever  depends  on  the  technology.  Success  finally  hinges 
on  the  information  itself  -  its  quality  and  ability  for  humans  to  communicate  real  needs 
which  can  be  translated  into  a  digestible  and  retrievable  form."1 

With  the  focus  dearly  on  our  role  as  information  providers  who  are  concerned  with  providing 
information  content  through  various  access  channels,  we  can  evaluate  the  trends  and  the 
challenges.  The  following  discussion  will  consider  briefly  users,  who  are  the  reason  for  building 
databases,  the  very  enticing  visions  of  how  databases  can  be  used,  and  finally  the  more  specific 
challenges  that  database  producers  must  meet  in  order  to  remain  effective  in  these  future 
scenarios. 

AEROSPACE  SCIENTISTS  AND  ENGINEERS 

There  are  many  different  information-using  communities,  and  the  differences  between  them  may 
be  large,  even  within  categories  such  as  R&D  or  engineering.  Thus,  in  order  to  meet  the 
challenges  ahead  as  database  publishers,  it  is  necessary  first  to  understand  the  particular  nature 
of  the  specific  users.  It  is  important  to  reduce  communications  barriers  so  that  we,  the  information 
providers,  may  do  our  part  in  aiding  aerospace  scientists  and  engineers  to  do  their  best  work. 

The  NASA/DOD  Knowledge  Diffusion  Project 2  is  providing  an  up-to-date  baseline  of  information 
habits  of  aerospace  scientists  and  engineers.  This  kind  of  understanding  is  critical.  Without  it, 
information  professionals  often  define  user  needs  in  terms  of  information  systems  tools,  rather 
than  as  the  need  to  solve  a  technical  problem.  For  instance,  one  study  3  defines  needs  in 
engineering  as  including  easier  access  to  databases  and  computer  program  libraries.  However, 
these  are  things  that  the  information  provider  must  do  to  meet  the  real  need,  which  is  for 
information,  not  systems. 

The  following  is  a  generalization  from  the  observations  that  have  been  made  at  AlAA’s  Technical 
Information  Service  about  the  aerospace  scientist  and  engineer  as  represented  by  the  AIAA 
member. 

Engineers  are  trained  to  be  "do-it-yourselfers."  Coupled  with  the  large  scale  of  many  aerospace 
and  defense  projects,  one  result  is  the  "not-lnvented-here  (NIH)  syndrome,"  the  tendency  to 
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believe  that  all  of  the  required  expertise  is  located  within  the  home  organization.  It  is  especially 
prevalent  in  the  aerospace  community.  Recent  National  Science  Board  studies  find  that  US 
scientists  and  engineers  continue  to  overcite  the  US  literature  and  to  cite  primarily  their  own 
specific  field  4.  They  seem  somewhat  unadventurous  in  exploring  the  literature,  even  as  they 
engage  in  very  adventurous  technology  programs. 

This  insular  thinking  is  so  pervasive,  that  the  NIH  syndrome  even  appears  in  the  field  of  space  law. 
One  would  expect  that  dealing  with  space  policy  would  be  Inherently  international. 

Vet  the  picture  should  not  be  painted  as  overly  bleak.  Although  the  focus  for  aerospace  scientists 
and  engineers  is  on  known  sources,  and  the  starting  point  for  information  location  is  collegial 
sources,  aerospace  scientists  and  engineers  do  use  the  literature,  particularly  journal  articles, 
conference  papers,  and  reports.  Pinelli  found  that  ninety  percent  indicate  that  technical 
communication  is  very  important.  ® 

The  scope  of  aerospace  and  defense  is  multidisciplinary  and  older  literature  is  as  important  as  the 
latest  leading  edge  material.  It  may  well  be  that  aeronautics  libraries,  if  forced  to  choose  one 
collection  only,  would  keep  the  old  National  Advisory  Committee  on  Aeronautics  (NACA)  reports 
first.  This  is  a  body  of  high  quality,  well  documented  knowledge  from  1915-1958.  Thus,  the 
original  quality  of  the  input  to  the  database  really  matters.  Perhaps  this  focus  on  reliability  should 
be  used  in  analyzing  other  databases.  Quality  is  a  hallmark  of  building  the  database  at  AIAA,  but 
maybe  it  is  time  to  test  perceptions. 

Aerospace  scientists  and  engineers  are  not  merely  looking  for  information;  they  are  trying  to  solve 
problems  or  find  specific  answers.  It  is  therefore  necessary  to  organize  high  quality  information  in 
an  accurate  and  effective  fashion  and  then  to  design  the  tools,  the  technology,  for  ready  access. 

That  aerospace  scientists  and  engineers  are  not  intimidated  by  information  technology  is 
supported  by  Pinelli's  findings  6  that  these  scientists  and  engineers  understand  what  the 
technologies  are  and  that  a  majority  even  identify  the  following  technologies  as  *1  don!  use  it,  but 
may  in  the  future”: 


°  laser  disk/video  disk/compact  disk 

°  video  conferencing 

°  electronic  bulletin  boards 

°  electronic  networks 

These  findings  are  supported  by  unpublished  AIAA  member  surveys,  which  show  extensive  use 
of  PCs  for  external  network  communications  and  information  retrieval,  and  a  strong  interest  in 
information-technology-based  technical  information  services.  These  scientists  and  engineers  are 
also  at  ease  with  related  terminology,  such  as  electronic  networks,  computer  conferencing,  and 
optical  disk. 

One  reason  to  give  scientists  and  engineers  credit  for  being  comfortable  with  information 
technology  is  that  they  are  the  ones  who  built  it.  It  is  the  job  of  the  information  professional  to 
organize  the  information  content,  but  this  should  be  done  in  collaboration  with  the  users. 
Scientists  and  engineers  are  not  unaware  of  the  problem  of  organizing  and  making  accessible 
large  volumes  of  information.  In  1984,  a  NASA  Ames  scientist '  described  a  vision  of  networks 
and  workstations  for  instant  comprehensive  access  in  a  paper  about  "Information  Systems:  Issues 
in  Global  Habitability”.  On  this  all  important  issue,  the  authors  noted  that  'little  has  been  said 
regarding  the  problems  of  how  to  collect,  access,  manipulate  and  store  data  on  such  an 
overwhelming  scale.”  This  large  scale  of  scientific  and  technical  information  is  a  particular  issue  in 
aerospace,  but  is  not  unique  for  these  databases. 

There  is  now  the  opportunity  to  work  closely  with  the  constituency.  If  database  producers  treat 
their  customers  as  partners,  they  will  work  with  them.  In  this  way,  by  reinforcing  our  appreciation 
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for  the  information  requirement  of  users,  database  producers  wM  deliver  the  most  appropriate 
information  and  avoid  being  carried  away  by  what  may  excite  us  the  most. 

THE  FUTURE  -  THE  VISIONS 

Database  pubfishers  have  a  variety  of  visions  or  scenarios  to  use  as  a  planning  basis.  These 
visions  generally  have  two  common  elements:  desktop  access  (and  note,  no  intermediary)  and 
virtual  integration  (any  type  of  information  available  through  a  single  interface). 

The  parent  of  these  futures  for  scientific  and  technical  information  was,  Vannevar  Bush,  a  US 
science  advisor,  who  coined  the  phrase  'MEM EX  machine’  in  1945. 8  In  those  times  he  saw  the 
data  store  as  microfilm  but  the  data  structure  as  hypertext,  afl  integrated  with  the  desktop  (or 
actually,  the  desk). 

Another  early  vision  was  Organization  X. 9  This  was  to  be  a  coordinated  effort  on  the  part  of  the 
scientific  and  technical  information  abstracting  and  indexing  services  to  reduce  overtap  and 
expand  coverage  through  a  joint  venture  operation,  which  would  then  facilitate  project-focused 
repackaging. 

However  the  NFSAIS  members  (National  Federation  of  Science  Abstracting  and  Indexing 
Services,  now  known  as  the  National  Federation  of  Abstracting  and  Information  Services/NFAIS) 
who  commissioned  the  study  which  recommended  Organization  X,  were  overwhelmed  with 
concerns  about  competition,  and  no  grand  schemes  were  undertaken.  Fortunately,  NFAIS  is 
currently  working  on  the  themes  of  cooperation  and  standards,  working  to  provide  leadership  to 
database  publishers. 

In  the  early  eighties,  James  J.  Harford,  the  AIAA  Executive  Director,  articulated  a  vision  of  a  global 
science  and  technology  network  (STN) 1 9  to  serve  five  million  scientists  and  engineers  from  their 
desktops  with  all  types  of  information:  journal  literature,  conference  proceedings,  design 
specifications,  meeting  programs,  electronic  mail,  handbook  data;  all  internationally  available  with 
language  translation,  news,  discussion  groups,  and  so  on. 

Harford  wrote  that  there  are  "no  technological  show-stoppers  in  the  way  of  realization  of  this 
scenario.  There  are  formidable  costs  and  difficult  organizational  and  political  barriers."  He  then 
described  his  experiences  trying  to  get  the  engineering  societies  to  cooperate  to  achieve  this 
vision.  Harford  was  no  more  successful  than  NFSAIS  was  in  joining  abstracting  and  indexing 
services.  But  the  American  Chemical  Society  took  on  the  STN  notion  in  their  own  way. 

A  key  lesson  learned  from  reviewing  these  visions  is  the  need  for  a  truly  cooperative  team,  if  the 
information  and  the  technology  are  ever  to  come  together.  In  August  1990  the  Southern 
California  Online  Users  Group  (SCOUG)  met  to  analyze  database  quality  criteria.  They  developed  a 
quality  rating  scheme  that  considers  that  the  database  and  the  system  must  be  evaluated 
together;  after  all  they  are  only  useful  together. 

More  recent  versions  of  the  "vision"  begin  to  address  the  effort  that  it  wilt  take  to  get  there.  Martha 
Williams  understood  that  it  will  take  extensive  research  and  development  to  build  these 
transparent  information  systems.1 1  Are  database  publishers  contributing  sufficiently  to  this 
effort?  The  commercial  publishers  perform  market  research  and  product  development.  In  the  US, 
the  National  Science  Foundation  no  longer  funds  academic  information  science  research, 
supporting  computer  science  instead.  How  will  this  research  be  done?  The  NATO  Advisory  Group 
for  Aerospace  Research  and  Development,  Technical  Information  Panel  is  currently  developing  a 
research  agenda.  Other  information  professional  organizations,  such  as  the  Special  Libraries 
Association  and  Medical  Ubraries  Association,  also  recognizing  the  need  for  research,  have 
developed  research  agendas  too.  It  is  hoped  that  these  agendas  will  encourage  both  researchers 
and  funders. 
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Mick  O'Leary  expresses  the  vision  and  the  challenge  another  way,  protecting  that  the  1990s  will 
be  the  Age  of  Access.  He  means  by  this  that  The  90s  wM  be  less  concerned  with  seeking  new 
information  than  with  organizing,  distributing,  and  applying  that  which  we  already  have.” 12 

Monitor  arguably  the  most  insightful,  but  surely  the  most  enjoyable  reading  for  the 
commerdal/extema)  database  community,  tried  to  foretell  the  90s.  They  believe  that  the  message 
of  the  BOs  was  that  not  everyone  needs  access  to  huge  databases  and  that  more  is  not  always 
better.  They  predict  that  the  '90s,  fostered  by  the  power  of  the  desktop  workstation,  will  see 
integration  of  internal  and  external  data.  Database  publishers,  serving  the  professional 
community,  wM  see  a  shift  from  print  information  to  electronic  m  either  ASCII  or  digitized  form  and 
...  thus  more  easily  exploitable  at  the  local  workstation.” 13 

Vast,  resource-rich  networks  are  moving  from  vision  to  reality.  One  of  the  major  challenges  for 
database  producers  is  to  relate  to  these  networks.  Network  developers  have  their  work  to  do.  For 
instance  NASA  has  five  networks,  not  afi  of  which  interconnect.  To  solve  this  dilemma,  companies 
have  formed  (e.g.,  DASNET  and  GEONET)  to  switch  messages  among  electronic  mail  systems. 
And  in  the  US,  it  is  estimated  that  the  largest  network,  the  INTERNET,  consists  of  330,000 
computers  and  2  million  users.  The  proposed  National  Research  and  Education  Network  (NREN) 
will  expand  this  INTERNET  and  is  the  beginning  of  a  locus  on  networks  as  part  of  an 
infrastructure  for  access  to  information  resources,  rather  than  simply  as  highways  providing 
connectivity  and  carrying  bits  from  one  place  to  another.” 14 

A  leading  force  behind  the  NREN,  Robert  Kahn,  takes  the  "wired  nation”  vision  the  next  step, 
saying  that  "you'd  tike  an  participants  in  countries  to  share  information  in  a  uniform  way.” 15  Kahn, 
the  designer  of  ARPANET  (the  first  network),  recognizes  the  need  to  deal  with  issues  beyond 
moving  megabytes  of  data  around  and  that  data  organization,  standards,  intellectual  property 
rights,  data  access  rights,  are  afi  issues  that  we  must  address.  The  dialog  on  these  topics  should 
include  a  significant  presence  from  the  database  producing  community,  although  it  does  not  yet 
do  so.  NFAIS  has  opened  discussions  with  Kahn  as  a  first  step  in  this  direction. 

The  urgency  of  the  issues  mentioned  above  is  emphasized  when  we  acknowledge  the  interface 
part  of  Kahn's  vision.  He  is  building  bits  of  artificial  intelligence  software  called  Knowbots,  which 
will  take  the  query  and  find  information  anywhere  in  the  network  in  a  process  that  is  completely 
transparent  to  the  user.  So  records  and  even  parts  of  records  will  be  disconnected  from  the 
databases  and  host  systems.  Rather  than  ignoring  the  challenges  posed  by  this  image,  database 
publishers  must  step  up  to  them.  Users  will  appreciate  it. 

Networks  already  have  demonstrated  their  impact.  Recently  the  international  mathematics 
community  used  electronic  mail  networks  to  solve  a  problem  in  a  matter  of  weeks,  as  competing 
teams,  aided  by  the  ease  of  communication  and  the  energy  of  competition,  rushed  to  solve  a 
complex  problem.  One  mathematician  said  that  'computer  networks  are  replacing  professional 
journals  and  conferences.” 16 

One  disquieting  facet  of  these  visions  of  instant  access  to  a  universe  of  information  is  that  the 
visionaries  tend  to  ignore  the  role  of  the  professional  database  publisher.  Yet  here  are  the 
information  scientists  who  can  address  the  problems  of  information  structure  for  interchange  and 
access.  Here  are  the  people  who  manage  the  quality  of  the  data.  If  information  access  is  to  be 
meaningful,  the  Information  'must  be  complete,  current,  and  presented  in  ways  that  encourage  its 
use.  It  must  also  be  reliable,  accurate,  and  absolutely  dependable.” 17  Lois  Granick  stated  the 
charge  dearly,  that  database  publishers,  must  take  an  active  leadership  role  or  lose  control  and,  in 
so  doing,  the  world  loses  an  opportunity  to  become  an  Informed  society." 


FROM  VISION  TO  REALITY — SEGMENTATION 


Although  afi  of  these  visions  express  the  notion  of  integration  of  multiple  resources,  the  reality, 
pushed  by  the  same  technological  opportunities,  is  really  diversity,  at  least  from  the  database 


publisher  perspective.  Morry  Goldstein,  President  of  IAC,  says  that  "the  wide  variety  of  ways  user 
organizations  wish  to  use  the  information  we  offer  for  sale  means  we  are  being  forced  into  an  RFP 
(request  for  proposal)  Kind  of  business  wherein  we  respond  to  very  specific  user  requirements 
with  a  proposal  tailored  to  that  environment." 18 

The  user  community  is  not  monolithic.  It  is  not  even  as  simple  as  intermediaries  or  information 
professionals  and  end-users.  One  can  subdivide  these  groups,  for  instance,  reference  Ubrarians 
and  Nterature  searchers,  special  Hxarfans  and  academic  research  Nbrarians,  engineers  and 
scientists,  researchers  and  managers.  The  categories  can  and  should  be  expanded,  if  not  for  the 
purpose  of  developing  different  databases  for  each,  at  least  for  analyzing  how  the  database  may 
be  used  by  each  segment.  Database  publishers  have  so  far  tailored  products  for  professional, 
library,  management,  and  consumer  communities.  In  the  future,  more  differentiation  may  occur. 

Within  the  aerospace  ST1 -using  community,  the  user  environment  is  also  segmenting,  with 
different  technology  platforms  befog  favored  for  information  access  by  different  groups.  The 
academic  community  has  adopted  CD-ROM  and,  in  large  universities,  is  moving  to  "local  online" 
for  widely  used  databases,  (local  online  is  online  from  a  university  mainframe  for  institution-wide 
access,  usually  combined  with  the  online  tibrary  catalog).  Government  users  are  beginning  to  rely 
on  networks.  Corporate  users  may  go  "local  online"  but  are  also  becoming  buk  buyers  from  the 
vendors’  services,  because  they  generally  use  so  many  databases. 

While  most  online  use  in  aerospace  is  mediated  by  the  information  professionals,  the  scientists 
and  engineers,  at  least  those  who  are  A1AA  members,  tell  us  that  they  want  to  search  directly  from 
their  own  desktop.  AIAA  is  attempting  to  meet  this  challenge  of  diversifying  user  segments  with  a 
new  service,  a  menu-driven  front-end  on  DIALOG,  called  the  AIAA  Connection.  The  AIAA 
Connection  will  do  more  than  just  search  the  Aerospace  Database.  It  will  also  provide  electronic 
mail,  bulletin  boards  (for  member  committees,  for  AIAA  headquarters  departments,  for  meeting 
schedules,  etc.),  document  ordering,  and,  ultimately,  other  databases. 

Meanwhile,  many  aerospace  scientists  and  engineers  are  using  optical  storage  for  raw  data.  Also, 
aerospace  corporations  are  evaluating  optica)  storage  tor  archives.  If  these  platforms  become 
wide-spread  within  the  end-user  base,  should  database  publishers  redesign  their  products  for 
end-users  in  localized  settings? 

World  political  changes  will  also  have  an  impact  on  aerospace  scientific  and  technical  information 
service.  For  now  it  appears  that  the  changes  will  be  negative,  resulting  in  closed  libraries,  cost- 
recovery  requirements  for  the  Department  of  Defense  Defense  Technical  Information  Center,  and 
information  budget  cuts  generally.  Yet  at  the  same  time,  these  changes  seem  to  be  encouraging 
research  and  development,  the  major  process  for  scientific  and  technical  information  use  and 
generation.  In  fact,  AIAA  membership  is  growing,  not  declining,  as  some  might  expect  in  these 
confused  times.  So  the  information  providers  will  be  even  more  pressed  in  meeting  the  needs  of 
the  many  segmented  users  and  environments  requiring  our  services.  It  will  be  necessary  to  be 
creative,  cooperative,  and  flexible. 


CHALLENGES  FOR  DATABASE  PUBLISHERS 


FlextoHity  is  the  critical  keyword  for  the  future  of  database  publishers.  Some  other  good  keywords 
are  progressive,  technology-driven,  strategic  planning,  and  user-focused.  But  most  important  is, 
flexibility. 

In  1965,  Martha  Williams,  a  longtime  observer  of  the  database  world,  identified  the  issues  relating 
to  electronic  databases  as  'public-private  sector  competition,  transborder  dataflow,  copyright, 
downloading,  and  the  changing  roles  in  database  generation  and  processing." 19  These  issues 
are  stM  around,  but,  today's  challenges  have  a  lot  more  to  do  with  leadership  and  determination. 
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Getting  Digital 

If  the  future  is  electronic  distribution,  then  the  first  challenge,  and  the  key  to  flextoiiity,  is  to  digitize 
the  data  -  ait  of  it .  This  is  not  so  simple.  First  publishers  must  move  beyond  bibliographic 
information  and  on  to  full-text,  numeric,  and  image  Information.  Database  publishers  will  have  to 
manage  and  integrate  a  variety  of  input  techniques,  such  as  manual  data  entry,  data  conversion 
contractors,  scanning,  and  author  electronic  submission. 20  We  will  need  to  avoid  seduction  by 
enthusiastic  sales  people,  because  selecting  the  methods  that  fit  any  particular  material  is  not 
necessarily  easy.  There  are  aU  sorts  of  details  to  be  resolved.  Some  examples:  Scanning  from  thin 
paper  is  a  problem,  because  back  printing  bleeds  through;  Authors  do  not  like  to  code 
manuscripts;  Data  conversion  contractors  are  fast  and  low  price,  but  cannot  correct  errors;  Who 
has  the  right  to  correct  another  publisher's  text? 

However,  managing  information  means  that  database  producers  must  do  more  than  create  an 
electronic  store.  The  EUSIDIC  community,  in  the  spring  of  1990,  focused  the  challenge  on  data 
structure. 21  To  do  this,  database  publishers  need  to  be  willing  to  cross  barriers,  even  to  work  with 
other  publishers.  How  else  can  we  use  hypertext  to  go  from  data  tags  to  the  numeric  tabular  data, 
from  the  role  indicator  to  the  chemical  reaction,  from  the  description  to  the  computer  code,  or  from 
the  abstract  to  the  full-text?  How  else  can  we  move  beyond  Boolean  logic  and  controlled 
vocabulary  to  the  world  of  Robert  Kahn's  Knowbots?  How  else  can  we  move  from  the  retrieval  of 
information  to  enable  the  use  and  manipulation  of  that  information?  And  how  else  will  we  separate 
quality  from  quantity? 

Database  producers  will  have  to  change  their  practices  and  strategies  to  meet  the  challenges  of 
creating  new  data  structures  for  electronic  dissemination.  We  must  support  real  information 
science  research.  And ,  even  more  difficult ,  we  must  develop  a  cooperative  effort  and  a 
willingness  to  commit  to  using  standard  practices. 

Database  publishers  once  rejected  the  notion  of  Organization  X.  Today  they  hardly  use  the 
standards  that  do  exist  (ISO,  NISO).  Will  they  take  the  primary  publishers  as  an  example  and  adopt 
a  Standardized  General  Markup  Language  (SGML)  application  for  databases?  (This  positions 
publishers  tor  acceptance  in  many  user  environments,  since  database  loading  will  be  consistent). 
Will  they  control  the  future,  or  let  others  create  the  standards,  as  database  vendors  have  done  in 
the  last  decade?  If  the  database  professionals  sit  back,  the  network  developers  will  lead,  without 
benefit  of  our  understanding  of  information  science,  database  structure,  information  retrieval,  and 
so  forth. 


Quality 

Creating  a  complete  electronic  data  store  is  not  the  whole  answer  unless  the  challenge  of  quality 
is  met.  Quafity  is  this  year's  hot  topic,  at  least  in  the  US,  and  with  good  reason.  We  face  a  less 
passive  audience,  one  that  can  select  from  a  variety  of  resources.  Expert  users  have  become 
increasingly  vocal.  As  mentioned  earlier,  SCOUG  is  developing  a  rating  scheme  for  database 
quality,  one  that  goes  to  considerable,  specific  depth. 

End  users  are  clear  that  they  want  quality  not  quantity.  AIAA  members  often  say  that  their  library 
gives  them  too  much  information,  rather  than  the  answer  to  their  problem. 

Total  Quality  Management,  and  its  related  concept  of  continuous  improvement,  has  become 
entrenched  in  government  contracting  within  the  US  Department  Of  Defense  and  the  National 
Aeronautics  and  Space  Administration.  It  should  apply  to  information  as  well  as  manufacturing. 

Aitehison  raised  the  issue  of  quality  in  his  1988  Miles  Conrad  Memorial  Lecture,  "Aspects  of 
Qualty." 22  Several  recent  articles  have  identified  certain  quality  characteristics  in  common,  as 
shown  in  the  following  table. 
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Accuracy 

Reliability 

Consistency 

Comprehensive 

Timely 

Cost/Value 

Support 

Accessible 


QUALITY  CHARACTERISTICS  FOR  DATABASES 


Aitchison2^  Potzscher2*  Maxon-Dadd2^ 


X  X 

X  X 

XXX 
XXX 
XXX 
X 


Mintz26  SCOUG27 
X  X 

X  X 

X 
X 

X  X 

X 

X  X 

X  X 


Additional  characteristics  include  integration  (with  the  retrieval  system,  links  to  full  text),  controlled 
vocabulary,  and  error  correction  policy.  The  list  is  long.  It  is  important  that  database  creators  work 
with  users  to  establish  a  thoughtful  evaluation  process  and  respond  to  constructive  critiques  of 
their  products 


Business  Issues 

After  the  objective  of  creating  digital  databases  is  achieved,  we  next  face  a  very  complex  set  of 
business  challenges. 

Pricing  is  perhaps  the  primary  conundrum,  because  if  we  are  wrong,  we  may  not  receive  sufficient 
revenue  to  continue  to  produce  the  database.  While  for  most  there  will  still  be  print  products, 
electronic  dissemination  in  muMple  environments  will  begin  to  dominate.  There  is  a  cost  in 
delivering  tor  multimedia,  segmented  user  communities,  and  in  achieving  a  new  quality  standard. 
The  intellectual  efforts,  such  as  tor  abstract  writing,  still  have  a  price  tag.  It  is  not  a  trivial  challenge 
to  learn  how  to  price  according  to  the  value  of  the  information,  but  we  will  have  to  try. 

Packaging  is  an  issue  quite  related  to  pricing.  Delivering  to  some  of  these  diverse  distribution 
channels  may  require  incorporation  of  hardware  or  software  with  the  database.  Many  database 
publishers  have  never  had  sufficient  information  technology-skilled  staff  to  manage  the  design, 
much  less  the  customer  service  aspects,  of  such  packages. 

The  third  business  challenge  is  to  create  fair  business  agreements  between  data  creators, 
distributors,  and  users.  The  balance  between  having  a  crystal  clear  and  comprehensive  license 
agreement  and  retaining  some  sense  of  trust  and  respect  among  the  partners  is  very  difficult.  In 
multinational  arenas  the  legal  issues  are  surely  more  complex.  We  cannot  be  naive,  but  we  should 
not  be  controlled  by  our  attorneys.  Ron  Dunn,  now  of  Maxwel  Multimedia  Publishing  Group, 
always  said,  and  I  paraphrase,  that  lawyers  are  there  to  get  you  out  of  trouble,  after  the  fact,  and 
not  to  keep  you  out.  Or  else  you  would  never  do  anything  in  the  first  place. 

Policy  Challenges 

The  business  issues  are  real  ones,  critical  to  our  survival.  The  policy  challenges  are  equally 
serious.  One  long-standing  topic,  a  hot  one  in  government-dominated  aerospace  and  defense,  is 
the  role  of  the  public  sector  as  information  provider.  The  US  Congress  sponsored  a  recent  study, 
Helping  America  Compete  28  which  urges  the  federal  government  to  recognize  the  value  of 
scientific  and  technical  information  and  to  reinforce  its  agency  and  unified  efforts.  In  Europe  DG- 
XIII  of  the  EEC  has  specifically  and  strategically  supported  information  service  and  information 
technology  developments.  In  Japan  the  approach  is  to  organize  government  run  information 
coflection  and  dissemination  centers.  Overall,  the  private  sector  has  benefited  from  government 
support  and  has  returned  the  service  by  adding  value  and  dissemination  expertise  to  basic 
government  data. 

However,  at  least  in  the  US,  there  are  tensions,  as  government  begins  to  add  value  to  its 
databases  and,  in  particular,  beoomes  CD-ROM  crazed.  Another  source  of  tension  can  be 
regulatory,  a  realty  as  AT&T  and  the  Baby  Bells  push  for  the  ability  to  provide  information  services. 
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The  push-pull  of  this  relationship  can  only  be  managed  with  honest  and  open  dialog.  If  we  play 
power  games,  we  will  be  distracted  from  the  information  providing  objective. 

Addressing  public-private  sector  relationships,  Is  part  of  the  process  of  readdressing  all  of  our 
relationships.  Networking,  for  example,  may  alter  the  roles  of  all  of  the  players:  database 
publishers,  authors,  primary  publishers,  vendor-host,  gateway,  Ibrary,  etc.  One  way  to  approach 
this  is  a  pending  effort  by  NFAIS  to  revist  their  Gateway  Code  of  Practice,  a  matrix  of  rights  and 
responsibilities,  in  light  of  the  diversification  of  online  environments  to  include  networks,  local 
online,  and  local  area  networks. 

Not  to  be  lost  in  revisiting  the  relationships  in  the  information  creation  and  delivery  chain  are 
intellectual  property  rights.  This  topic  is  at  least  a  full  paper  in  itself.  Intellectual  property  is  an  active 
arena,  and  some  current  topics  are: 


°  Database  deposit,  US 
°  Copyright  of  databases,  EC 
°  Author  awareness  of  rights 
°  Role  of  reproductive  rights  organizations 
°  Chain  of  rights  (author-primary  publisher-full  text 
database -online  service) 

°  Electrocopying 

The  last  group  of  challenges  -  ethics  and  cooperation  -  have  been  written  of  before.  There  is  an 
urgency  incumbent  upon  us  now  to  meet  these  challenges  at  a  high  level  if  we  are  to  achieve  our 
mission  in  the  complex  future  environment.  It  is  only  by  taking  the  high  moral  road  that  we  win  be 
able  to  handle  the  other  challenges. 

Weil  29  and  Granick  30  nave  been  our  most  consistent  spokespersons  on  the  subject.  Let  us 
take  the  time  to  reflect,  to  remember  why  we  are  here,  and  to  review  our  mission  -  our  focus.  Most 
in  the  scientific  and  technical  information  profession  are  here  because  of  a  strong  belief  in  its 
value  in  the  world.  If  this  is  stiM  true,  we  must  recognize  our  purpose  and  use  it  to  empower  us  to 
address  our  roles  and  responsibilities  with  seriousness. 

CONCLUSION 

Aitchison  said  it  so  weU: 

’Instead  of  worrying  too  much  about  who's  doing  what  to  whom,  we  should  get  on  with  our  work 
and  keep  developing  our  plans  without  looking  over  our  shoulders  all  the  time.  In  particular,  we 
should  resolve  that  if  anyone  ever  gives  our  users  what  they  really  want,  we  shall  be  part  of  the 
team  that  does  it." 31  If  we  do  this  in  a  cooperative  and  proactive  manner,  we  can  achieve  our 
goals. 

This  paper  discussed  the  challenges  for  data  content  and  electronic  distribution,  quality  and 
standards,  flexibility  and  leadership,  ethics  and  cooperation.  If  we  see  our  job  as  delivering  the 
right  information  at  the  right  time  and  place,  then  we  must  get  on  with  meeting  these  challenges. 
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EXEMPLES  DE  BASES  DE  DONNEES  UTIL1SANT  DES  TEXTES 
NUME RISES  (TEXTES  INTEGRAUX  OU  RESUMES) 


par 


Pascal  Pellegrini  et  Philippe  Laval 

CORASA 

93,  avenue  de  Fontainebleau 
94270  Le  Kremlin-Bicetre 
France 


Le  but  de  cet  expose  est  de  montrer  les  avantages  d'une  base  documentaire  index6e  directement 
sur  le  texte,  par  opposition  aux  systftmes  classiques  k  thesaurus. 

Cet  expose  est  divise  en  trois  parties  :  aprfes  un  rappel  de  Vital  de  l’ait  dans  le  domaine  de  la 
iinguistique  informatique,  nous  expliquerons  comment  DARWIN™  realise  cette  indexation.  Enfin, 
nous  conclurons  avec  plusieurs  exemples  duplications. 


I.  La  Iinguistique  informatique. 


Le  but  de  la  Iinguistique  informatique  est  de  doter  l'ordinateur  de  capacity  de  traitement  de  la 
langue  nature  lie  incluant  un  minimum  de  comprehension.  Cette  definition  exclut  en  particular  toutes 
les  applications  de  type  traitement  de  texte,  traitement  statistique  ou  recherche  simple  de  mots  clefs  k 
partir  (Fun  thesaurus  construit  manuellement. 

Dans  ce  chapitre,  nous  ne  nous  interesserons  qu’k  la  langue  ecrite,  une  fois  tous  les  probtemes  de 
saisie  iesolus. 

I.l-Le  domaine 

f.i.q  Us  outils 

•  Parseurs  (ou  analyseurs  syntaxiques):  il  s’agit  de  programmes  qui  calculent  la  representation 
syntaxique  d'une  phrase. 

•  Grammaircs:  ensemble  de  regies  permettant  de  decrire  les  constructions  syntaxiques 
autorisees. 

•  Lexiques  informatiques:  citons  le  syst£me  DELA  du  LADL  (Universite  de  Paris  7)  qui 
couvre  plus  de  700  000  formes  fl6chies  des  mots  fran^ais  et  400  000  mots  composes  de  la 
langue  tran^aise, 

•  Qutilde  representation  des  connaissances  (pour  le  traitement  setnantique  des  textes). 

•  Generates  de  mmirelto 

LLt _ Lei  applications. 

•  Traduction  assistee  par  ordinateur. 

•  Aide  i  la  redaction :  correcteur  orthographique,  grammatical  ou  stylistique,  dictionnaire  de 
synonymes... 


Recherche  documentaire. 
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J.l.g  Les  realisations  industrielles. 

•  Interrogation  de  base  de  donn^es: 

-  Pages  jaunes  de  l'annuaire  (GSI-Erli) 

-  Q&A  (Symantec) 

-  Clout 

•  Wrificateurs  orthographiques : 

-  Writer  Workbench  (ATT) 

-  Alpha  (Borland) 

-  Sprint  (Borland) 

•  Traduction  assists : 

-  SysTran  (Gachot) 

•  TAO  (Alp  Systems) 

-  Weidner 

-  Atlas  (Fujitsu) 

-  Pivot  (Nec) 

•  Recherche  documentaire : 

-  Baris  (Batelle) 

-  Spirit  (Systex) 

-  Darwin  (Cora) 

1,2  Le  problems  de  la  wmprthensipiu 

1.2. a _ La  trois  juxeaux.  d’analvm  de  la  longue  torUe. 

n  est  g6n£ralement  reconnu  qu'il  existe  trois  niveaux  d'analyses  pour  la  langue  6crite,  k  savoir: 

•  le  niveau  svntaxique.  qui  ddcrit  les  regies  de  formations  des  phrases  et  les  transformations 
possibles  d'une  structure  en  une  autre  (par  exemple :  passif  vs.  actif,  affirmatif  vs.  nggatif). 

•  le  niveau  sgmantique.  qui  traite  du  sens  des  mots  et  des  moyens  de  calculer  le  sens  d’une 
phrase  &  partir  du  sens  de  ses  mots. 

•  le  niveau  praymarique  ou  extra-linyuistioue  qui  consiste  en  la  connaissance  du  monde 
environnant. 

LZJl _ La  comprehension. 

le  but  de  la  comprehension  automatique  du  langage  est  de  transformer  une  phrase  en  une 
structure  interne  explicitant  les  relations  entre  les  diff brents  concepts  de  la  phrase,  sachant  que: 

•  la  representation  doit  faciliter  les  inferences  possibles, 

•  le  degrade  comprehension  depend  de  1’objectif  vise, 

•  il  riy  a  pas  isomorphisme  entre  la  phrase  et  la  structure  interne. 


12. c  Us  orinciaales  difficult** 

Nous  allons  dresser  une  sorte  de  catalogue  des  principles  difficult^  li£es  It  l'analyse 
automatique  du  langage : 

•  Les  ambipiftfe  lexicales  (homogranhiei: 

La  petite  brise  la  glace. 

La  belle  feme  la  porte.. 

Time  flies  like  an  arrow. 


•  Us  polysemies: 


voter  (en  l’air,  un  objet) 
am,  beam 


•  Les  attachements  prtpositionnels: 

Elle  mange  le  poisson  avec  des  arites. 
Elle  mange  le  poisson  avec  unefourchette. 
He  saw  the  cat  with  a  long  tail. 

He  saw  the  cat  with  a  telescope. 

•  La  portfe  des  conionctions: 

Les  chats  et  les  chiens  sans  queue. 

Les  poulets  et  les  chiens  sans  queue. 
Cats  and  dogs  without  a  tail. 
Chickens  and  dogs  without  a  tail. 

•  Les  toumures  eUipdoues: 

Quelle  est  la  capitale  du  Guatemala  ? 
DuMexique  ? 

Which  is  the  capital  of  Guatemala  ? 

Of  Mexico  ? 


•  Les  anaphorcs: 

Le  professeur  a  envoyi  le  cancre  chez  le  censeur  car  : 
] .  il  en  avait  marre 

2.  il  voulait  le  voir 

3.  il  langait  des  boulettes 

•  Vocatailairc  rocomplet- 

-  N6ologismes. 

-  Creations  d'auteur. 

•  Les  expressions  fiafes  et  semi-figfes: 

Prendre  la  poudre  cTescampette 
dcouteauxtiris 
raining  cats  and  dogs 
to  have  to  see  a  man  about  a  dog 
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•  Structure  svntaxiaue  identiaue  avec  un  sens  txfes  differents: 

Jean  est  difficile  &  convaincre. 

Jean  est  habile  d  convaincre. 

John  is  easy  to  persuade. 

John  is  eager  to  persuade. 

•  Structures  svntaxiques  difftSrentes  avec  sens  proches: 

Elle  marche  en  souplesse. 

Sa  demarche  est  souple. 

She  walks  in  a  relaxed  way. 

Her  walk  is  relaxed. 

•  Sous-entendus  (compris  par  tout  humain): 

J'ai  iti  dans  trois  librairies  ce  matin,  (je  n'dtzds  pas  dans  les  trois  librairies  k  la  fois) 

I  had  been  in  three  bookstores  this  morning  when  it  began  to  rain. 

•  Inexistence  d'un  ensemble  de  rfegles  qui  engendreraient  toutes  les  phrases  acceptables  d'une 
langue  et  rien  que  celles-lk:  de  nombreuses  connaissances  sont  inexprimables  sous  forme  de 
regies  (pour  les  resolutions  pronominales  par  exemple). 

•  Imperfection  des  textes  (fautes  de  frappe,  d'orthographe,  de  grammaire,  de  ponctuation,  de 
style). 

•  Manque  d'int£i€t  des  linguistes  purs  pour  les  textes  reels. 


|  II.  Les  bases  de  donates  constitutes  de  textes  integrauxl 


Dans  de  nombreux  domain es  (finance,  assurances,  administrations,  joumaux...)  les  donndes 
sont  presentees  sous  forme  de  textes  ecrits.  L'un  des  problfcmes  de  1'intelligence  artificielle  est  la 
representation  des  connaissances  et  la  conversion  de  ces  demi&res  en  des  formalisme  capables  d'etre 
interpretes  par  des  ordinateurs. 

Dans  la  suite,  nous  ne  nous  interesserons  pas  au  problfcme  de  la  comprehension  et  du  traitement 
des  textes.  Par  contre,  nous  allons  nous  attacher  tout  particuliferement  aux  methodes  k  mettre  en 
oeuvre  pour  etre  k  meme  de  pouvoir  retrouver  toutes  les  informations  contenues  dans  ces  textes. 

Deux  methodes  differentes  sont  utilisees  pour  la  recherche  documentaire:  les  systfemes  k 
thesaurus  (ex:  BASIS)  et  les  systfcmes  indexant  directement  le  texte  (DARWIN). 


L 
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II.  1  Les  svstfemes  k  thesaurus 

Ces  systkmK  ne  font  pas  intervenir  de  traitement  de  la  langue.  Pour  chaque  famille  de  textes,  un 
thesaurus  est  constitul  manuellement.  Les  textes  sont  indexes  sur  les  mots  contenus  dans  le 
thesaurus.  Cette  mlthode  est  facile  It  mettre  en  oeuvre  de  mankre  informatique.  Cependant: 

•  elle  ne  garantit  pas  que  toutes  les  informations  con  tenues  dans  les  textes  pourront  £tre 
ietrouvees:  l'oubli  d'un  mot-clef  entraine  l'impossibilite  de  retrouver  les  portions  de  textes 
Ik  contenant.  La  constitution  du  thesaurus  est  done  une  6 tape  delicate  demandant  beaucoup 
de  travail  et  une  connaissance  parfaite  des  textes  k  indexer, 

•  la  mise  k  jour  du  thesaurus  (ajout  d'un  nouveau  mot-clef  par  exemple)  entraine  une 
reindexation  de  toute  la  base:  ceci  peut  fitre  gfinant  si  la  base  est  tres  importante, 

•  la  modification  de  la  base  de  donnles  documentaire  peut  entrainer  une  mise  k  jour  du 
thesaurus  (pour  prendre  en  compte  un  nouveau  domaine  par  exemple),  Le  cofit  de  la 
maintenance  est  done  important, 

•  Indexation  mot  k  mot  introduit  des  bruits  lors  des  recherches.  Par  exemple,  les  mots  "or"  et 
"car"  posent  problkme:  s'ils  sont  mis  comme  mots-clefs,  ils  introduisent  un  bruit 
considerable,  car  tous  les  or  et  car  du  texte  sont  indexes,  mime  si  ce  sont  des  conjonctions 
de  coordination.  Par  contre  s'ils  ne  stmt  pas  mis  comme  mots-clefs,  le  "silence"  risque  d'etre 
important.  De  fa$on  similaire  les  homographies  (ex:  lajoue/il  joue)  ne  sont  pas  levies.  Ce 
bruit  est  encore  plus  g&iant  en  anglais  ofi  quasiment  tous  les  mots  sont  des  homographes 
nom/verbe.  Void  quelques  exemples  de  bruit: 

Or,  ilfaisait  chaud. 

La  petite  Marie  joue  dehors. 

Les  poules  de  Marie  convent. 

•  la  recherche  de  conseil  general  va  donner  comme  rlponse  tout  ce  qui  touche  conseil  et 
tout  ce  qui  touche  general,  ce  qui  introduit  encore  un  bruit  considerable.  Ce  problkme  est 
partielleinent  rlsolu  dans  certains  systkmes  grkce  k  la  notion  de  proximite:  dans  notre 
exemple  il  ne  faudra  foumir  de  rlponses  que  si  glnlral  suit  directement  conseil,  ce  qui 
foumit  les  bonnes  rlponses. 


Afin  de  garantir  une  plus  grande  fiabilitl  et  un  cofit  moindre,  nous  voyons  qu'il  est  nlcessaire  de 
supprimer  la  constitution  du  thesaurus;  e'est  ce  que  realisent  les  systfcn.es  indexant  directement  les 
textes. 

II. 2  Les  systems  indexant  dirretsment  les  textes 

DARWIN  part  d'une  approche  radicalement  differente:  plutfit  que  de  constituer  un  thesaurus 
manuellement,  it  texte  est  analyse  de  fa$on  k  identifier  les  concepts  (groupes  nominaux  Itendus)  qu'il 
contient.  L'indexation  se  fait  sur  ces  concepts.  Cette  indexation  se  fait  k  partir  d'une  analyse 
syntaxique  appuyfce  par  des  rfcgles  linguistiques.  Par  exemple,  le  texte: 

Le  petit  chat  est  noir. 

sera  indexl  sur  les  concepts  "petit  chat"  et  "noir".  Cet  exemple  est  bien  entendu  trfcs  simple  et 
DARWIN  est  capable  d'analyser  tous  les  textes  ayant  une  orthographe  et  une  syntaxe  correcte. 

Cette  analyse  est: 

*  MrtnwMriqiift-  "ll*  n*  nfccessite  «n«nv>  intervention  partirailifcrfr 

•  indlpendante  du  domaine  dans  lequel  on  Ivolue.  puisque  la  seule  contrainte  est  que  les  textes 
respectent  les  rkgles  de  la  grammaire.  Cela  sigmfie  en  particulier  qu'il  n'est  pas  nlcessaire 
d'apprendre  k  DARWIN  Ik  expressions  signincativK  du  contexte  (qu'il  s'agisse  de  textes 
jurlmques,  techniques,  littfirairK...), 
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•  fidfcle  et  complete:  les  expressions  retenues  constituent  le  fond  documentaire,  conune  pour 
les  systimes  k  thesaurus  Cependant  l'indexation  automatique  repfcre  tous  les  objets  sans 
exclusion  et  avec  une  trts  grande  fiabilitd,  alors  que  les  systemes  a  mots-clefs  sont  souvent 
peu  fiddles  et  dependants  des  peraonnes  qui  indexent: 

-  les  bruits  dfls  aux  tomographies  nom/verbe  sont  supprimds  puisque  l'analyse 
linguistique  permet  de  detector  les  verbes, 

•  les  bruits  dfls  aux  tomographies  dfles  aux  mots  grammaticaux  (comme  or  et  car)  sont 
eux  aussi  dliminds  grfice  k  l’analyse  du  texte, 

-  les  problfcmes  de  proximitd  disparaissent:  la  recherche  de  conseil  ggnlral  donnera 
comme  rgponse  tous  les  groupes  nominaux  contenant  effectivement  conseil  g£n£ral 
et  next  ce  qui  touche  conseil  et  glngral. 

DARWIN  permet  d’organiser  les  textes  de  manifere  arborescente  en  documents,  sous-documents 
etc. . .  afin  de  structurer  les  textes. 


1 1. 2. a _ Constitution  des  bases i  textuelles 

Les  dtapes  suivantes  stmt  n&essaires: 

•  introduction  des  textes,  soit  en  les  tapant  directement,  soit  k  l'aide  de  scanners, 

•  correction  des  textes  introduits  afin  qu'ils  respectent  les  regies  de  la  grammaire.  En  effet,  les 
scanners  introduisent  gdndralement  des  erreurs  (sur  par  exemple), 

•  structuration  des  textes  (organisation  en  documents,  sous-documents. . .), 

•  indexation  des  textes  structures. 

i.  Introduction  des  testes 

Trois  cas  se  prgsentent: 

•  le  texte  existe  ddjk  sous  forme  informatique;  il  a  a  priori  un  format  quelconque  (Word, 
WordStar,  Sprint,  WordPerfect...).  D  faut  done  le  convertir  en  ASCII.  Cette  operation  ne 
pose  pas  de  probl&me  particulier  puisque  tous  les  traitements  de  texte  proposent  une  sortie 
ASCII  dtendue. 


•  le  texte  existe  uniquement  sous  forme  papier,  il  est  numdrisd  avec  un  scanner.  Cette 
operation  ndeessite  une  relecture  du  texte,  car  certains  caractferes  sont  souvent  mal  reconnus. 
Les  erreurs  les  plus  frdquentes  sc  font  entre  les  i  et  les  1,  les  1  et  les  I,  les  8  et  les  g.  H  faut 
noter  que  m£me  s’il  n’y  a  qu’une  erreur  dans  le  texte,  il  faut  relire  tout  le  texte. 


•  le  texte  n’existe  pas  encore;  il  est  alors  saisi  directement  sous  DARWIN. 


ii»  .Curi 


flea  textes 


Cette  6tape  a  pour  but  de  coiriger  les  erreurs  rdsiduelles  dans  le  texte  k  l'aide  de  correcteurs 
orthographiques.  Ces  erreurs  sont  essentiellement  de  deux  types: 


•  les  erreurs  de  saisie,  et  celles  induites  par  les  scanners:  inversion  de  deux  caractdres,  oubli 
d’un  caractfcre,  erreur  de  frappe  sur  un  caract&re,  caractfcre  dddoubld.  Ce  sont  des  erreurs 
lexicales. 


•  les  phrases  mal  constitutes,  qui  ne  sont  pas  correctes  grammaticalement  Ce  sont  des  erreurs 

avptMiaucs. 

JL-Errcun  Jgteatei 

Elies  sont  anez  fadlement  d&ectables  en  comparent  chaque  mot  du  texte  avec  un  lexique.  Plus  le 
lexique  contiendra  d’entrfes,  meilleure  sera  la  correction.  Cependant,  le  n ombre  de  formes  fldchies 
d’une  langue  (plus  de  700  000  pour  le  fianqais)  fait  qu’il  n’est  pas  possible  de  toutes  les  stocker  dans 
un  dictionnaire.  Par  exemple,  le  dictionnaire  des  formes  fldchies  du  L.A.D.L.  (DELAF)  occupe  plus 
de  20  Mtea-octets.  La  str&gie  la  plus  utilisde  est  de  ne  Stocker  que  les  radnes  de  tous  les  mots,  puis 
de  “calcufer”  leur  flexion  ensuite.  Cela  permet  de  ne  pas  avoir  &  mdmoriser  toutes  les  conjugaisons 
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des  verbes,  ni  lcs  accords  en  genie  et  en  nomine  des  noms  et  adjectifs.  47  entries  sont  ainsi 
dconomisdes  pour  chaque  verbe  en  frangais.  L’algorithme  utilisd  est  le  suivant:  le  correcteur 
orthographique  lit  chaque  mot  du  texte.  II  compare  ensuite  ce  mot  avec  ceux  de  son  lexique.  S’ilne 
trouve  pas  de  correspondence  directe,  il  recherche  parmi  les  mots  qui  lui  sont  connus  ceux  qui  sont  le 
plus  “pinches”.  Queiques  mdthodes  utilisdes  sent: 

-  une  correction  phondtique:  le  mot  inconnu  est  phondtisd.  Par  exemple  le  mot 
“habytatioa”  est  transformd  en  “abitassion”  Puis  un  certain  nomine  de  itgles  sont 
udfiseespour  transformer  le  mot  de  fa^on  k  le  mettre  en  correspondance  avec  un  mot 
connu.  Cette  mdthode  permet  de  corriger  facilement  toutes  les  erreurs  sur  les  accents, 
les  terminaisons  en  “non”,  les  lettres  dddoubldes  (comme  dans  “dldgamment”),  les 
“ph”,  les“y"etc... 

-  une  correction  morphologique:  le  mot  est  ddcomposd  est  syllabes,  puis  des  rdgles 
substituent  une  syllabe  par  une  autre, 

-  une  correction  en  recherchant  la  distance  minimale  par  rapport  aux  mots  connus;  le  mot 
fait  l’objet  de  substitutions  successives  le  “rapprochant”  d\m  mot  connu, 

-  d’autres  mdthodes,  qui  essaient  de  ddtecter  les  erreurs  les  plus  frdquentes:  inversion  de 
caractdres,dddouUementdecaractdies... 

Les  correcteurs  orthographiques  du  marchd  ne  corrigent  que  les  erreurs  lexicales.  Leur  n ombre 
limitd  d’en tides  (queiques  mzaines  de  milliers  de  mots)  fait  qu’ils  ddpassent  rarement  95%  de 
idussite. 

1  Erreurs .  grammalicalcs 

Les  coirecteurs  pideddents  ne  ddtectent  aucune  entur  dans  une  phrase  comme: 

Les  belles  jouent  de  Marie  sont  roses. 

qui  est  pourtant  incorrecte.  Seule  une  analyse  syntaxique  de  la  phrase  permet  de  corriger  ce  genre 
d ’erreurs.  Un  analyseur  syntaxique  ne  convient  pas,  car  la  idussite  de  son  analyse  ddpend  de  la 
justesse  syntaxique  de  la  phrase.  Ulustrons  ceci  par  I'exemple  prdeddent.  En  simplifiant,  un  analyseur 
syntaxique  cherche  &  analyser  une  phrase  sous  la  forme 

Phrase  ->  Sujet  +  Verbe  +  Objet 
Sujet  ->  Groupe  Nominal 
Objet  ->  Groupe  Nominal 
Groupe  Nominal  ->  Article  (+Adjectij)  +  Norn 

Notre  analyseur  va  commencer  par  rechercher  un  sujet,  done  un  groupe  nominal  qui  est  constitud 
d’un  article,  facultativement  d’un  adjectif  et  d’un  nom.  D  va  done  commencer  sont  analyse  de  la 
manidre  suivante: 

Sujet  ->  Les  belles 
Verbe  •>  jouent 

par  contre  il  va  dchouer  tors  de  la  constitution  de  l’objet,  qui  serait  de  Marie  sont  roses ,  car  le 
verbe  jouer  n’admet  pas  d’objet  du  type  de  +  quelque  chose  dans  notre  grammaire.  Nous  voyons 
claire ment  que  l’analyseur  syntaxique  ne  peut  pas  ddtecter  l’erreur  sur  jouent.  Nous  nous  sommes 
capables  de  la  ddtecter  en  comprenant  le  sens  de  la  phrase,  ce  qui  nous  permet  d’isoler  le  sujet  (Les 
jouent  de  Marie),  le  verbe  (sont)  et  l’objet  (roses).  Ayant  bold  chaque  structure  grammaticale,  nous 
pouvons  ddtecter  l’erreur  sur  jouent. 

A  notre  coraiaissancc,  il  n’existe  actuellement  aucun  correcteur  syntaxique. 

Hi.  Structuration  des  textes 

Un  texte  d’un  seul  tenant  est  inutilisable  dds  qu’il  ddpasse  une  ceitaine  taille.  C’est  pourquoi 
DARWIN  permet  d’organiser  les  textes  de  fagon  hidrarchique  en  documents,  sous -documents  etc... 
Cette  dtape  est  bien  entendu  maraielk. 

iv.  Indexation  des  tottea 

L’indexatkn  est  automatique.  EUe  peut  se  faire  de  deux  mani&res: 

•  une  indexation  immddiate,  guise  fait  document  par  document, 

•  une  introduction  manive,  qui  persist  d’indexer  toute  une  suite  de  documents  en  une  seule 
fob.  Ceb  pcarmetpre  exemple  le  rdahser  l’indexation  d’une  grosse  base  textuelle  la  nuit,  sans 
ucqm  ioitfvmim  nmuelfe. 
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ILUt _ Ouestionnement  des  bases  textuelles 

L'utilisateur  peut  questionner  directement  la  base  de  donndes  &  partir  d'un  ou  plusieurs  mots  ou 
expressions.  DARWIN  rtpond  en  deux  temps: 

•  il  donne  une  liste  des  expressions  con  tenues  dans  la  base  de  donnles  et  proches  de  la 
question  pos£e.  Cette  liste  est  acceptde  ou  modifife  par  l’utilisateur. 

•  Aprfcs  validation  de  la  liste  des  expressions  accept6es,  il  propose  une  liste  des  textes  trouvds. 
II  sont  prdsent^s  classes  par  ordre  de  pertinence,  les  plus  impress  ants  apparaissant  en 
premier. 

Les  mots  ou  expressions  peuvent  Stre  tronqudes  (par  exemple  lingui*  permet  de  rechercher  tous 
les  concepts  commen^ant  par  lingui  (les  rfponses  peuvent  done  toe  linguiste,  linguistes, 
linguistique...) .  Cette  possibility  permet  entre  autre  de  fake  des  recheiches  sans  se  prtoccuper  des 
probl&nes  de  flexion  (singulier/pluriel,  masculin/singulier). 

Q  est  de  plus  possible  de  fare  des  combinaisons  logiques  entre  les  mots  et  expressions  avec  les 
optoateurs  et  et  ou.  Les  requites  suivantes  sont  valides: 

Unguis# *  ou  informatiqu*  (pour  rechercher  ce  qui  parie  de  linguistique  et  d’informatique). 

conseil  regional  et  Bretagne  (pour  rechercher  ce  qui  parie  du  conseil  gtodral  breton). 


|  III.  Exemples  de  bases  textuelles  utilisant  DARWIN  | 

Void  quelques  exemples  d'applications  de  DARWIN: 

•  R.A.T.P.  (R6gie  Autonome  des  Transports  Parisiens):  gestion  du  personnel  (sur 
VAX/VMS). 

•  OUEST-FRANCE:  plus  de  l  Go  de  textes  (120  000  articles)  sur  des  compatibles  IBM-PC 
enrtseau, 

•  joumaux:  LIBERATION,  TELEPOCHE,  LE  MONDE  INFORMATIQUE,  LA  CHARENTE 
LIBRE... 

•  LA  REDOUTE:  tissusdifeque,  documentation  gtotoale  sur  IBM  VM^CMS, 

•  GRONDi  systfcme  r&iactionnel  de  dictionnaire  d'artistes... 

•  LES  MUTUELLES  du  MANS:  rgglementation  des  agences  (1000  sites),  documentation  de 
l'infocentre  sur  BULL  DPS  6000. 

•  COURTAULD,  GOUVERNEMENT  FEDERAL  DU  CANADA:  version  anglaise  de 
DARWIN, 

•  DER  SPIEGEL:  version  allemande  de  DARWIN. 
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What  I  would  like  to  do  today  is  to  give  you  a  brief 
overview  of  PROFILE  Information,  the  process 
involved  in  getting  and  delivering  a  full  text  source 
of  information  online  and  the  primarily  economic 
issues  involved.  Hie  process  is  broken  down  into 
collecting  the  data,  designing  and  developing  the 
database,  delivering  and  selling  the  database. 

PROFILE  INFORMATION 

PROFILE  Information  originated  as  a  full  text  news 
database  primarily  serving  the  media  industry. 
Media  users  include  the  major  international 
television  companies  (BBC,  ABC)  and  major 
newspapers  and  journalists.  PROFILE  now  holds 
the  full  text  of  all  the  UK  major  national 
newspapers  (FT,  Independent,  Guardian  etc), 
newswire  sources  such  as  Associated  Press,  and 
international  publications  such  as  The  Economist, 
Business  Week,  and  The  Washington  Post. 

PROFILE  has  since  expanded  it’s  service  primarily 
to  serve  the  professional  sectors  (Management 
Consultancy,  Advertising,  Marketing,  and  also  the 
Financial  Analyst  and  Researcher  market).  This 
has  meant  adding  market  research  reports 
(Euromonitor,  Keynotes  etc),  company  house 
information  on  all  22  million  limited  UK 
companies,  and  company  news  from  a  range  of 
European  Publications. 

1  have  broken  down  the  process  of  providing  access 
to  a  full  text  online  database  into  the  following 
sections: 

1.  Choosing  the  relevant  sources; 

2  Collecting  the  data; 

3.  Designing  and  developing  the  database; 

Delivering  and  selling  the  database. 


In  general  terms  once  chosen,  the  sources  of 
information,  which  are  increasingly  in  an  electronic 
format,  are  transmitted  to  the  online  company. 
However,  if  data  is  not  already  in  electronic  format 
the  hardcopy  has  to  be  scanned.  This  is  then  run 
through  checking  programmes  and  then  through 
programmes  that  structure  the  data.  The  data  is 
then  loaded  and  indexed.  Delivery  to  the  customers 
is  primarily  through  users  dialling  up  via  local 
telephone  services  or  via  International  data 
communication  networks  such  as  Telenet.  Selling 
the  database  will  be  through  a  direct  sales  team, 
distributors,  such  as  Data  Arkiv  or  Aftenposten  or 
through  gateways  such  as  ESA  or  electronic  mail 
services.  In  addition  sales  are  generated  through 
advertising  and  trade  shows.  All  of  this  requires 
significant  investment  in  computer  power  and 
storage,  and  manpower.  Nexis  in  the  States  has  two 
hundred  salespeople.  At  PROFILE  we  would  have 
one  third  of  die  staff  in  Sales  and  Marketing,  one 
third  in  data  preparation  and  one  third  is  system 
development. 

Choosing  the  Relevant  Sources 

Online  databases  tend  to  conform  to  two  major 
types: 

1.  The  large  host,  such  as  Dialog,  carrying 
numerous,  diverse  sources,  and 

.  2  The  niche  database. 

The  first  question  therefore  is  what  type  of  database 
do  we  want  to  be,  are  we  for  example  »ma««i«g  as 
many  sources  as  possible  to  become  the  generic  one 
stop  information  source,  or  highly  focused.  If  one 
is  targeted  or  niche  oriented,  bow  is  that  niche 
defined,  what  are  the  job  functions  of  our  target 
users,  what  information  and  in  what  form  it 
be  delivered  to  facilitate  these  functions.  Is  there  a 
base  of  information  required,  for  example,  of 
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company,  market,  product  and  industry  information 
with  a  leaning  towards  specific  vertical  markets. 

We  are  then  drawn  into  the  complex  debate  of  the 
value  of  information.  The  database  provider  has  to 
decide  whether  the  cost  of  hosting  the  data  can  be 
justified  by  either  demand  by  a  mass  market  or 
demand  by  a  niche  market.  The  latter  could  be 
through  a  more  "tailored"  service  enabling  a  higher 
margin. 

A  company  will  create  it’s  own  database  or  have  a 
private  file  on  a  bureau  of  internal  company  news 
and  yet  use  a  database  such  as  PROFILE  to  get 
public  news  about  it’s  competitors  and  clients.  The 
latter  may  then  be  redistributed  alongside  the 
internal  corporate  information  via  internal  corporate 
systems  or  on  request  from  the  librarian.  For  a 
company  to  go  for  the  internal  solution,  that 
solution  needs  to  be  tailored  aind  satisfy  a 
"concentrated  need".  However  an  online  host  may 
see  a  "concentrated  need"  that  could  be  satisfied  by 
"filleting"  the  database  and  delivering  that  sub  set  of 
information  via  an  SDI  service,  by  fax  or  E-mail  or 
perhaps  by  producing  a  CD-ROM  product.  The 
latter  may  be  facilitated  by  providing  a  vertical 
market  or  task  driven  interface. 

The  "host"  type  solution  may  imply  aiming  for 
sources  that  encourage  as  much  traffic  or  requests 
as  possible.  However,  the  most  profitable  situation 
occurs  where  the  information  is  of  high  value, 
storage  costs  are  low  and  transactions  are  few. 
Databases  have  gone  out  of  business  because  their 
computing  costs  increased  at  a  greater  level  than 
their  actual  margin.  More  sources  were  added 
involving  more  computing  costs. 

Collecting  the  Data 

The  main  two  questions  here  are:  is  the  data  in  an 
electronic  format  and  what  is  the  quality  of  the 
data.  Increasingly  publishers  are  using  the  "direct 
input"  computerised  editorial  systems,  such  as 
ATEX  or  Norsk  Data.  This  means  that  data  can  be 
transmitted  to  the  online  ho6t  in  a  digitalised  format 
containing  "flags’  or  codes  that  can  be  used  in  the 
loading  programme.  If  the  haru  copy  has  to  be 
scanned  using  optical  character  recognition  costs  are 
significantly  higher. 

The  quafity  of  the  data  varies  in  terms  of  the 
editorial  quality  and  also  the  quality  and  structure 
of  the  data.  At  PROFILE,  for  example,  we  employ 
sub-edfeorial  staff  to  write  informative  headlines,  to 
restructure  tabular  material,  to  draw  together 
duplicated  news  wire  stories.  With  regard  to  the 


structure  of  the  data,  how  extensive  and  how  regular 
is  the  news  editorial  systems  coding.  Are  headlines, 
author  field,  end  of  text  markers  always  apparent. 
This  in  turn  has  bearing  on  how  sophisticated  and 
tailored  the  collection  programmes  have  to  be  and 
also  how  much  manual  intervention  and  checking  is 
required. 

Designing  and  Developing  the  Database 

Initial  questions  about  the  database  software  may  be 
it’s  appropriateness  fc.  textural  information  versus 
numeric  information.  What  proportion  of  the 
textural  information  is  numeric,  what  costs  would  be 
involved  in  processing  numeric  information  on  a 
primarily  textural  database?  Certain  software  may 
be  expensive  or  CPU  intensive  to  load  but  cheaper 
to  access  if  the  indexing  chosen  addresses  the  needs 
of  the  users.  In  general  the  larger  the  database  the 
greater  the  processing  costs,  but  depends  also  on 
the  volume  of  transactions.  Updating  is  also  a  key 
issue,  to  update  one  huge  databases  may  mean 
rebuilding  the  entire  index.  Therefore  tk;re  is  an 
implied  need  to  break  the  database  up  into  multiple 
databases  with  multiple  indexes.  If,  for  example, 
you  had  one  index  and  kept  adding  pointers  to 
other  sub  indexes,  the  process  would  gradually  get 
slower  and  slower  and  increasingly  expensive. 

Also  involved  on  database  design  side  of  a  full  text 
free  text  search  system  is  whether  all  words  really 
are  searchable.  If  one  could  search  for  "stop 
words",  (and,  or,  of,  the  etc.)  this  would  be 
expensive  in  CPU  terms.  It  may  not  be  useful  to  be 
able  to  search  on  common  "banners"  or  disclaimers 
occurring  in  every  item  and  by  excluding  these  could 
save  money.  The  degree  of  searchability  therefore 
is  a  relation  between  how  useful  to  the  user  are  the 
searchable  fields  and  how  expensive  does  this  make 
an  individual  search.  Many  online  databases  do  not 
adequately  reflect  users  needs  in  terms  of 
searchability  or  manipulation  of  the  data.  One  way 
round  this  is  to  produce  focused  PC  interface 
software  that  will  help  manipulate  the  data  on  the 
database  and  achieve  specific  goals.  This  avoids 
having  to  include  all  these  facilities  on  the 
mainframe  interface. 

Delivering  and  Selling  the  Database 

When  an  online  company  is  setting  a  cost  for  the 
online  product  significant  underlying  ratios  are 
apparent.  Computing  costs  can  be  as  much  as  30%, 
royalty  payments  to  the  data  providers  could  be 
again  30%  and  cost  of  sale  i.e.  marketing, 
promotion  and  salespeople  could  be  another  30% 
This  leaves  remarkably  little  profit  margin  and 
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results  in  a  juggling  act  between  these  three  factors 
and  raises  questions  about  how  can  we  add  value  to 
the  data  and  how  can  additional  revenue  be  brought 
in  from  the  same  computing  cost  i.e.  how  can  the 
margins  be  increased.  For  example,  charged 
multiple  use  of  the  same  information  in  an 
organisation  would  increase  the  text  vendor’s 
margin.  Early  investment  and  high  costs  may  imply 
a  high  degree  of  profitability  in  the  future. 

To  reduce  the  cost  of  sale  the  full  text  vendor  may 
decide  to  do  their  business  through  third  parties. 
They  may  be  distributors  or  gateways  through  other 
online  databases  or  other  means  of  delivery  such  as 
electronic  mail  companies.  Access  to  full  text 
databases  is  attractive  to  the  occasional  researcher 
who  may  use  electronic  mail  for  communication 
purposes.  The  negative  aspect  of  this  approach  is 
that  the  online  vendor  loses  control  over  their 
product  and  does  not  have  as  close  a  relationship 
with  their  third  party  customer  as  they  would  have 
with  a  direct  customer.  Usage  may  therefore  be 
limited  but  could  have  the  reward  of  broad 
penetration  of  markets  the  online  vendor  would  not 
otherwise  reach.  Distributors  also  need  constant 
management  to  maintain  their  commitment. 

With  the  direct  customer,  there  is  the  proverbial 
debate  about  who  one  is  selling  to,  the  end  or  the 
intermediary.  Full  text  was  envisaged  as  appealing 
to  the  end  user.  However  usage  is  probably  still 


60%  by  the  intermediary  or  information 
professional.  This  is  primarily  due  to  three  reasons, 
firstly  full  text  has  been  successful  and  popular  with 
the  intermediary  and  secondly,  the  lack  of  products 
that  relate  and  are  well  marketed  to  specific  end 
users  with  specific  job  functions  and  needs  and 
thirdly,  the  technical  and  psychological  barriers  of 
using  an  online  service. 

CONCLUSION 

Managing  full  text  online  databases  is  therefore  a 
juggling  act,  manipulating  a  variety  of  factors  such 
as  royalty  payments,  the  cost  of  sale,  revenue,  and 
the  underlying  equations  in  the  technical  area 
(speed  of  access,  size  of  storage,  computer 
processing  time,  updating).  Generally  there  is  a 
need  for  market  specialisation,  understanding 
specific  job  functions  and  related  needs.  These 
needs  will  be  satisfied  by  the  online  vendor  through 
a  variety  of  means.  For  example,  providing 
information  through  various  mediums  such  facsimile 
(FAX),  electronic  mail  and  CD-ROM.  A  variety  of 
channels  such  as  Corporate  Networks  and  internal 
information  systems  will  be  used  to  deliver  online 
information.  It  is  also  apparent  that  the  textual  and 
numeric  online  vendor  will  add  value  to  the 
information  they  store  through  effective  personal 
computer  (PC)  and  mainframe  interfaces  helping 
users  retrieve,  manipulate  and  distribute  the 
information. 
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Abstract 

The  recent  Portuguese  economic  developments 
are  described,  as  well  as  the  main  plans  that 
shape  the  future  developments  in  Portugal.  A 
brief  overview  of  the  industrial  structure  is 
provided  and  the  national  information 
infrastructure  is  analysed,  with  special 
reference  to  the  National  Library,  the  public 
libraries,  the  archives,  special  libraries  and 
information  systems,  telecommunications,  the 
information  industry  and  available  training 
programmes  in  Library  and  Information 
Science.  The  Programme  for  the  Development 
of  the  Information  System  for  Industry  is 
described  and  special  emphasis  is  given  to  the 
Post-Graduate  Course  for  Information 
Intermediaries,  which  aimed  at  preparing 
qualified  professional^  to  staff  the  information 
units  of  the  System.  In  order  to  evaluate  the 
impact  of  training  provided  through  this  Course 
on  job  engagement  and  performance  of  these 
information  professionals,  a  survey  was 
conducted  and  the  results  obtained  are 
analysed  .  The  conclusion  stresses  that, 
considering  the  present  state  of  development 
of  the  Portuguese  information  infrastructure, 
a  major  contribution  to  reduce  the 
communication  gap  is  stM  to  invest  further  In 
education  and  training  of  information 
preNsaionala 


1-  INTRODUCTION  •  THE  PORTUGUESE 
ECONOMIC  DEVELOPMENT 

The  first  and  significant  effort  towards  the 
industrialisation  of  Portugal  occurs  only  after 
the  second  world  war.  Following  the  economic 
reconstruction  movement  of  the  European 
countries,  the  development  of  the  Portuguese 
society  from  a  rural  and  commercial  economy, 
was  then  initiated.  The  bases  of  the  national 
industrial  infrastructure  were  established  and 
are  still  the  main  support  of  the  present 
structure. 

During  the  1960s  and  1970s  there  was  a 
progressive  modernisation  and  openness  to 
foreign  markets.  The  electronic  and  mechanic 
industries,  the  pharmaceuticals,  textiles  and 
food  industries  sectors  developed. 

The  5  years  previous  to  1974,  were  years  of 
rapid  economic  growth  for  Portugal.  Between 
1968  and  1973,  the  production  increased  7% 
(average)  per  year. 

As  pointed  out  by  the  World  Bank  Report,  *  In 
terms  of  the  world  economy  and  its  impact  on 
the  domestic  situation,  the  Portuguese 
revolution  could  not  have  found  a  time  more 
likely  to  complicate  the  adjustment  and  impede 
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future  growth  than  April  1974.  The  oil  price 
increase  brought  with  it  a  substantial 
worsening  of  the  country's  terms  of  trade, 
while  the  downturn  in  Western  Europe  meant 
the  slackening  of  demand  for  Portugal's 
exports,  less  earnings  from  tourism  and 
perhaps  most  serious  of  all  the  levelling  off  of 
the  demand  for  Portuguese  workers"  (1). 

Between  1973  and  1977  the  GNP  per  capita 
was  virtually  stagnant.  The  flux  of  returnees 
from  ex-colonies  and  reduction  of  emigration 
increased  the  population  of  the  country  by 
10%.  important  markets  for  Portuguese 
products  were  lost  (Angola,  Mozambique). 

In  1976  there  was  a  slight  recovery  of  the 
economy,  measured  by  an  increase  of  8.6  %  of 
the  GNP.  This  tendency  continued  in  the  later 
years,  but  the  OECD  reported  in  1981  that,  at 
the  beginning  of  the  80s,  Portugal  was  in  many 
aspects  a  developing  country,  with  one  of  the 
lowest  GNP  of  the  OECD  countries  and  with  a 
production  capacity  largely  insufficient  to  meet 
the  demand  (2). 

In  the  80s  a  success  stabilisation  Programme 
within  an  agreement  with  the  International 
Monetary  Fund  (IMF),  allowed  the  country  to 
move  from  a  situation  of  13.5  %  deficit  in  the 
external  current  account  in  1982,  to  quasi- 
equHtorium  in  1985. 

The  increase  of  growth  permitted  the 
reduction  of  unemployment  from  8.6%  in  1984 
to  6.1  %  in  1988.  The  inflation  rate  in  the  same 
period  decreased  20  percentage  points,  and  a 
surplus  (overflow)  of  the  Balance  of  Current 
Transactions  allowed  the  repayment  of  part 
of  the  external  debt. 


2 -THE PLANNED  FUTURE 

The  medium-term  planning  strategy  for 
Portugal  is  defined  in  the  Main  Planning 
Options  1989/1992  (GOP)  adopted  by  the 
Government  (3).  The  overall  development 
strategy  of  the  GOP  relies  on  a  theoretical 
basis  that  considers  that  Portugal  will  be 
affected  by  the  dominant  evolution  tendencies 
of  the  internationalisation  of  the  world 
economy  and  the  process  of  European 
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It  is  also  assumed  that  a  correct  and  efficient 
utilisation  of  the  Structural  Funds  of  the  EEC, 
will  constitute  one  of  the  major  components  of 
the  global  development  strategy. 

The  GOP,  the  PCEDED  (Programme  for  the 
correction  of  the  structural  external  deficit 
and  unemployment),  and  the  Programmes 
within  the  European  Community  Aid 
Framework,  constitute  the  political  guidelines 
for  the  development  of  Portugal  in  the  near 
future. 

2.1  •  PCEDED  -  Programme  for  the  Correction 
of  the  Structural  External  Deficit  and 
Unemployment 

A  macroeconomic  strategy  for  Portugal  was 
set  up  in  1985  with  a  programme  called  "A 
strategy  of  controlled  progress".  This 
Programme  has  been  successively  revised  and 
adapted  (1987,1989)  and  is  still  a  major 
framework  to  understand  the  "Planned  Future" 
for  Portugal  (4). 

The  driving  forces  of  this  Programme  are  the 
modernisation  and  strengthening  of  the 
economy,  as  well  as  the  reduction  of  the 
external  dependency.  It  is  considered  that  the 
structural  adjustment  of  the  Portuguese 
Economy  and  its  preparation  to  the  1992  Single 
Market,  requires  a  macroeconomic  policy 
centred  in  three  main  areas: 

i)  The  modernisation  and  increase  of 
productivity,  by  means  of  a  continuing 
investment  effort. 

ii)  The  reduction  of  inflation. 

iii)  The  reduction  of  the  financing  of  the  Public 
Sector. 

The  sum  is  to  reach  1992  with  an  inflation  rate 
at  approximately  the  average  of  EEC 
countries,  with  an  acceptable  deficit  of  the 
Balance  of  Current  Transactions. 

With  the  framework  of  the  planning  document 
PCEDED,  several  projects  were  adopted  and 
are  being  implemented  within  the  social, 
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cultural  and  economic  areas  of  activity  of  the 
country. 

12 •hdusttal  Structure 

As  a  result  of  the  historical  and  economic 
development  process  of  the  Portuguese 
society  in  the  last  40  to  50  years,  different 
industry  capacities  related  to  different  types 
of  production  originated  an  industrial  structure 
covering  various  industries  with  technologies 
and  companies  of  several  generations. 

The  small  and  medium  enterprises  (SME)  have 
a  significant  contribution  to  the  GAV  (gross 
added  value)  in  the  transformation  industries, 
ca.  45%.  The  SME  and  even  the  very  small 
enterprises  with  less  than  6  employees 
represented  in  1988, 99,2%  of  the  total  of  the 
industrial  enterprises.  Only  264  industrial 
enterprises  employed  up  to  500  workers  and 
about  86  had  1000  or  more.  The  figures  for  the 
industrial  exports  show  also  that  more  than 
50%  of  the  country's  exports  are  done  by 
SME  (5). 

The  major  difficulties  faced  by  these 
enterprises  and  by  the  industrial  enterprises  in 
general  can  be  summarised  as  follows: 

-  few  innovative  entrepreneurs  as  shown  by 
the  low  levels  of  investment  in  new  areas; 

•  few  productivity,  as  a  result  of  the 
insufficient  professional  training  and  the 
use  of  obsolete  equipment  and  technology 
and  also  of  the  lack  of  adequate 
management  techniques; 

-  weak  financial  structures; 

-  very  weak  links  with  Universities  or 
Research  Institutes; 

•  lack  of  technical,  technological  and  market 
information. 


2J-  PEMP- (Structural  Programme  for  the 
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Taking  into  consideration  the  situation  above 
described,  the  Portuguese  authorities  defined 
as  the  fundamental  goal  of  the  country's 
industrial  policy  the  preparation  for  the  1992 
European  Open  Market 

One  of  the  major  operational  programmes 
within  the  Development  Plans  for  Portugal  is 
PEDIP,  (6)  not  only  because  of  the  relatively 
high  amounts  involved,  but  also  because  of  the 
leading  role  of  industry  in  the  process  of 
modernisation  and  development  of  the  country. 

The  objectives  of  this  Programme,  approved 
by  the  European  Commission  and  funded  by 
the  Structural  Funds  of  the  EEC,  are: 

i)  To  revitalise  the  present  industrial  base  by 
giving  financial  support  to  the  existent 
enterprises  which  have  solid  economic 
structures; 

ii)  To  develop  new  industries; 

iii)  To  eliminate  or  reduce  the  comparative 
structural  disadvantages  by  means  of: 
Infrastructure  support  -  creating  basic 
technological  infrastructures  and  launching 
professional  training  programmes;  Financial 
support  and  support  to  dynamic  factors  of 
competition  like,  improvement  of  quality, 
industrial  design,  productivity  missions. 

Several  funding  schemes  have  already  been 
implemented  and  a  great  deal  of  improvement 
is  expected  from  this  mqjor  Programme,  during 
its  4  years  duration,  until  1992. 

3-  THE  NATIONAL  INFORMATION 
INFRASTRUCTURE 

The  definition  of  the  Information  Infrastructure 
concept  involves  some  difficulty.  For  GRAY 
(7),  infrastructure  is  a  term  to  avoid  when 
National  Information  Policy  is  concerned, 
because  it  means  different  things  to  different 
people:  *  Some  equate  it  with  the  whole 
organisational  structure  of  information 
services,  others  with  general  services  as 
distinct  from  the  specialised  services  that  are 
provided  within  areas  of  economic,  social  and 
cultural  development". 


ZURKOWSKY  (8)  and  the  American 
Information  Industry  Association,  use  the 
neologism  "Infostructure*  to  encompass  ail  the 
elements  necessary  to  support  the  information 
handling  capacities  of  the  country.  The 
objective  of  the  infostruture  being  "to  create, 
communicate  and  deliver  information  useful  to 
ail  economic,  social  and  political  activities  of 
the  county". 

For  the  purpose  of  our  study  we  consider  the 
Information  Infrastucture  as  including  the 
following  basic  components: 


iii)  The  information  industry  in  this  context  is 
restricted  to  information  technology 
manufacturing  and  information  provision 
activities,  excluding  the  mass  media 
systems. 

In  order  to  draw  an  up-to-date  picture  of  the 
present  Portuguese  Information  Infrastructure, 
a  survey  was  done  as  part  of  the  work  that  is 
being  carried  out  by  one  of  the  authors,  at  the 
University  of  Sheffield.  The  major  findings  are 
reported  below. 

3.1  •  National  Library 


-  Information  resources 

•  Organisations 

-  Information  industry 

i)  By  information  resources  is  understood: 

-  the  information  sources  of  all  kinds, 
recorded  knowledge  in  every  physical 
support: 

•  the  human  specialists  devoted  to 
information-related  activities; 

•  the  technologies  used  as  a  means  of 
recording,  delivering  and  communicating 
information; 

•  the  telecommunications  devices  that 
constitute  the  channel  for  the  delivery  of 
information . 

ii)  As  organisations  of  the  information 

infrastructure  we  consider: 

-  the  institutions  traditionally  devoted  to 
the  activity  of  collecting,  handling  and 
making  available  the  recorded 
knowledge:  Libraries,  Information  and 
Documentation  Centres  and  Archives. 

-  the  more  recent  organisational 
environments  created  with  the  main 
purpose  to  enhance  the  knowledge 
diffusion  and  utilisation:  Science  Parks, 
Innovation  Centres,  Technological 
Centres. 


A  Programme  to  reorganise  the  National 
Library  started  in  1985.  As  a  result  of  the 
studies  undertaken,  a  decision  to  create  a 
National  Bibliographic  Database  -PORBASE- 
was  approved  by  the  Government.  The  GEAC 
system  was  chosen  and  the  process  to  install 
the  equipment  started  in  1987.  The  PORBASE 
is  now  available  online  having  around  200000 
records,  corresponding  mainly  to  the  National 
Bibliography  (1950-1990).  A  process  of 
retrospective  conversion  is  in  due  course.  It 
includes  also  records  from  the  holdings  of  other 
Libraries  all  over  the  country. 

Holding  ca.two  million  volumes,  the  National 
Library  has  developed  its  collection  specially  in 
the  fields  of  Arts  and  Humanities  while  it  has 
a  rather  small  collection  of  Science  and 
Technology.  With  a  small  budget,  two  thirds  of 
which  goes  directly  to  pay  salaries,  the  NL 
has  no  acquisitions  policy.  The  collection  is  built 
mainly  through  the  Legal  Deposit  (which 
increases  the  collection  by  a  proportion  of  10 
to  12  thousand  volumes  per  year).  The  lack  of 
trained  human  resources  is  considered  by  the 
management  the  major  problem  of  the  National 
Library.  In  a  total  of  around  300  staff,  only  33 
are  qualified  Librarians  and  at  present  no 
computer  specialist  is  employed  by  the 
Library,  despite  the  equipment  installed  (  a 
mainframe  computer,  several  micros  and 
peripherals).  This  shortage  of  qualified  staff 
prevents  the  library  from  opening  for  example, 
tiie  Audio-visual  section.  There  are  also  no 
research  staff,  albeit  the  NL  participates  in 
some  international  cooperative  projects, 
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namely  in  EEC  Action  Programme  for 
Libraries. 

3 2-  PubflcUbraries 

There  is  no  tradition  of  regular  use  of  Public 
Libraries  in  Portugal.  The  first  time  that  the 
public  Itorary  designation  appeared  was  in  the 
name  of  the  Real  Blblioteca  Publica  da  Corte, 
created  in  1796.  Several  other  public  libraries 
were  created  in  the  19th  century,  but  mainly 
because  die  extinction  of  the  religious  orders  in 
1834  resulted  in  a  need  for  the  preservation  of 
the  bibliographic  holdings  of  die  monasteries 
and  colleges.  With  the  advent  of  the  Republic, 
in  1910,  Libraries  started  to  be  seen  as 
"instruments  to  destroy  ignorance"  .  A  law 
published  in  1911  envisaged  the  creation  of 
"Popular  Libraries"  in  every  municipality.  In 
1919  a  survey  showed  that  there  were 
Iforaries  in  68  municipalities,  12  of  which  were 
still  in  an  organisation  phase  and  37  having 
less  than  2000  volumes.  In  1958  another  survey 
showed  that  from  the  273  municipalities  in  the 
country,  only  66  had  a  functioning  library  (ca. 
25%),  while  in  1982  (almost  a  quarter  of  a 
century  later)  this  proportion  has  increased 
only  10% :  from  the  total  of  275  municipalities, 
97  have  a  library. 

In  general  the  municipalities  with  libraries  are 
those  of  the  more  developed  areas  of  the 
country  and  the  common  feature  of  these 
libraries  is:  small  and/or  old  collections, 
obsolete  equipment  and  no  qualified  staff. 

Considering  this  scenario,  the  Secretary  of 
State  of  Culture  set  up  in  1986,  a  Task  Force 
with  the  objective  of  preparing  die  basis  for  a 
"coordinated  intervention  in  the  Public 
Libraries"  (9). 

Following  the  measures  proposed  by  die  Task 
Force,  a  programme  to  establish  a  Public 
Libraries  Network  is  now  in  operation.  This 
Programme  is  based  on  a  cooperative  action 
between  die  Local  Authority  (at  municipality 
level)  and  die  Ineftuto  Portugute  do  Livro  e  da 
Lekura  •  IPLL  (the  rational  coordinator  body) 
and  Is  aimed  to  cover  one  third  of  si)  the 
munidpaMes  during  its  five  years  duration.  It 
is  assumed  diet  by  that  time  "a  point  of  no 


return"  will  be  achieved  and  the  imitation 
effect  will  reach  the  rest  of  the  country.  Within 
the  Programme,  full  support  is  offered  to  the 
Local  Authorities  that  declare  an  interest  in 
creating  or  improving  their  public  library, 
provided  that  some  basic  conditions  are 
accepted: 

-  50%  of  the  total  costs  involved  must  be 
supported  by  die  Local  Authorities; 

-  the  library  must  follow  the  technical 
specifications  decided  by  IPLL, 
concerning  spaces,  technical  procedures 
(e.g.  lending  services  and  open  access), 
and  employ  qualified  staff. 

3.3  -  Archives 

The  preservation  of  archive  material  is  of 
considerable  importance  not  only  for  historical 
research  purposes,  but  also  as  an  instrument 
of  development  and  progress.  The  inter 
connection  between  the  resources  of  the  past 
and  the  concrete  situations  of  the  present  is  a 
factor  to  take  into  consideration  when 
planning  the  future. 

One  of  the  actions  included  in  the  GOP 
document  mentioned  in  section  2.  intended  to 
contribute  to  the  "Promotion  of  the  National 
Identity  Values",  is  the  development  of  the 
National  Archives  Network. 

The  National  Institute  of  Archives  was 
created  in  1988  having  as  its  mission  the 
establishment  of  a  network  of  all  archives 
collections,  on  a  district  basis. 

The  first  step  to  build  the  Network  is  the 
implementation  of  standardised  procedures.  A 
software  application  using  the  UNESCO 
CDS-ISIS  system  was  tailored  specifically  for 
the  Network  and  is  in  the  implementation 
phase.  This  will  help  to  follow  the  same 
methods  of  coding  and  storing  the  materials  by 
all  the  archives.  Microcomputers  were  supplied 
and  basic  computer  training  provided.  Again, 
the  lack  of  trained  human  resources  is  the 
major  constraint 
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3.4  •  Special  Libraries  and  Information 
Systems 

The  special  Libraries  and  Information  / 
Documentation  centres  existing  in  the  public 
and  private  organisations  are  in  general  the 
more  advanced  information  systems  in  the 
country.  The  development  of  information 
services  specially  tailored  to  meet  specific 
information  needs  of  particular  user  groups, 
occurred  first  in  the  Scientific  and  Technical 
Information  area.  Although  other  information 
services  have  been  developed  in  recent  times, 
the  STI  field  is  still  the  one  with  more  national 
coverage. 

In  the  Ministry  of  Industry  and  Energy  (MIE), 
several  specialised  organisations  set  up  their 
own  systems  and  started  creating  in-house 
databases. 

In  general,  the  information  centres  are  created 
to  meet  the  needs  of  the  organisation's  internal 
users,  but  later  develop  to  the  more  extensive 
purposes  of  providing  information  also  to 
external  users.  This  was  the  case  of  the 
National  Laboratory  of  Engineering  and 
Industrial  Technology  (LNETI).  Hie  Centre  for 
Scientific  and  Technical  Information  for 
Industry  (CITI)  of  LNETI,  whose  mission  is 
mainly  to  satisfy  the  research  and  technical 
staffs  information  needs,  developed  in  1985 
the  first  bibliographic  database  in  the  field  of 
industry  and  energy  available  online  in 
Portugal.  This  database,  is  the  online 
catalogue  of  the  CITI  libraries  and  aims  to 
cover  also  all  the  literature  produced  by 
Portuguese  researchers  in  the  field  of  industry 
and  energy  and  is  available  outside  LNETI. 

We  refer  the  example  of  LNETI,  but  we  know 
that  more  or  less  developed  and  technologically 
equipped  information  Centres  exist  in  the 
various  Ministries,  but  manly  concentrated  in 
the  Central  Administration  Departments. 

u*  lewconvnuncaoons 

In  the  past  5  years,  a  great  effort  to 
modernise  the  telecommunications  system  and 
to  introduce  new  technologies  was  done  in 
Portugal. 


The  digital  network  system  has  been  gradually 
implemented  and  by  the  end  of  1993  is 
expected  to  cover  all  the  country,  in 
substitution  of  the  analogic  system.  A  project 
to  implement  the  ISDN  is  planned  in  three 
phases: 

•  the  first  phase  called  a  pre-ISDN,  from 
1989-1990,  offering  services  like: 
videoconference,  telephone  with 
sophisticated  facilities,  high  speed  access 
to  VIDEOTEXT  and  MHS  networks, 
(Message  Handling  Systems)  as  well  as 
virtual  networks  to  private  systems. 

-  the  second  phase,  from  1990-1991,  will 
provide  1st  generation  ISDN  services,  like: 
highly  sophisticated  telephone,  digital 
access  to  Telepac(PPSN),  Videotext  and 
MHS,  intercommunication  between 
workstations  and  PC's,  data 
communications  at  64Kb/s. 

-  the  third  phase,  from  1992  onwards,  will 
provide  2nd  generation  ISDN  services  like: 
videotelephone,  videoconference,  digital 
portable  telephone,  access  to  all  services 
at  64Kb/s  (10). 

The  Public  Packaged  Switched  Network 
(TELEPAC)  was  commercially  available  in 
1985  with  3  switch  nodes  in  Lisbon,  Porto  and 
Coimbra,  and  by  the  end  of  1989,  27  nodes 
were  installed  covering  the  continent  and  the 
adjacent  islands  (Madeira  and  Azores). 

According  to  international  indicators  the  level 
of  utilisation  of  data  telecommunication  in 
Portugal  is  still  relatively  low,  however  it 
triplicated  from  1985  to  1987  and  the  annual 
growth  rate  is  greater  than  the  average  of 
EEC  countries. 

The  Videotext  system  and  POS  (Point  of 
Sales)  and  ATM  (Automated  Teller 
Machines)  services  are  the  main  causes  of  the 
increase  of  volume  in  data  transfer.  The 
Videotext  system  started  in  1989  and  was 
primarily  oriented  to  the  professional  market. 
In  February  1990,  the  use  of  the  system  was 
measured  in  4.604  hours  connection  time.  The 
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forecast  for  the  near  future  is  a  rapid  growth, 
mainly  because  of  the  penetration  in  the 
residential  market  (11). 

34>  •  Information  Musty 

A  recent  report  elaborated  for  the  EEC,  within 
the  IMPACT  Programme  (12),  refers  that  the 
information  industry  and  information  services 
market  in  Portugal  are  still  in  an  incipient 
phase  of  development.  The  information  related 
activities  were,  until  recently,  identified  only 
with  Libraries  and  STI  Centres.  Several 
initiatives  from  other  areas  are  enlarging  the 
borders  of  the  sector,  but  a  lack  of  identity  of 
the  different  players  in  the  field,  who  do  not 
see  themselves  as  being  involved  in  the  same 
business,  is  one  of  the  reasons  of  the  reduced 
size  of  the  information  industry  in  the  country. 
However,  it  is  reiterated  that  a  great  potential 
exists  in  some  areas. 

The  survey  shows  that  Tourism, 
Environment,  Services  to  Local  Communities, 
Geographic  Information  Systems  are  the 
areas  where  a  major  contribution  to  the 
development  of  the  information  market  in  the 
short  term,  is  expected.  This  conclusion  took 
into  consideration  the  evaluation  of  the 
projects  already  in  due  course,  the  resources 
located  to  them,  as  well  as  the  potential 
consumers  market. 

The  report  also  emphasises  that  the  lack  of 
qualified  human  resources  is  a  major 
constraint  to  the  development  of  the 
information  industry  and  information  services 
market  in  Portugal. 

9.7  -  iramng  rrogrammes 

The  officially  recognised  training  programmes 
in  Information  were  conceived  as 
specialisations  to  be  built  on  top  of  other 
academic  qualifications  rather  titan  a  scientific 
ground  on  its  own.  Therefore,  no  graduate 
program  is  available  on  this  area.  Training 
programmes  are  to  be  found  providing 
professional  orientated  courses  on  the  top  of 
secondary  education,  or  post-  graduation 
courses  open  to  every  graduation  background. 


Professionally  oriented  courses  are  taught  in 
some  Secondary  Schools  in  Coimbra,  Lisboa 
and  Porto  within  the  area  of  Humanities  and 
convey  specialised  training  concerning 
Information  Technology,  Sociology  of 
Information,  Management  and  information 
Techniques.  Two  branches  are  available,  one 
for  preparing  Library  professionals  and  other 
for  preparing  Archive  professionals,  both  at  an 
assistant  level.  Before  the  recent  integration 
of  these  curricula  into  the  National  Educational 
System  (1989),  these  courses  were  promoted 
and  supported  by  BAD  -  Associag&o 
Portuguesa  de  BibliotecArios,  Arquivistas  e 
Documentaiistas,  one  of  the  Portuguese 
professional  Associations. 

The  Post-Graduation  Courses  in  Librarianship 
were  created  in  1983  in  order  to  replace  the 
former  post-graduation  course  that  had  a  run 
since  1935  with  a  rather  conservative 
curriculum  which  aimed  mainly  at  the  training 
of  archive  specialists.  These  two-year  courses 
are  run  by  the  Universities  of  Coimbra,  Lisboa 
and  Porto,  the  first  and  second  semesters 
consisting  of  a  joint  curriculum  both  for 
librarians  and  archivists,  while  the  third  and 
fourth  semesters  are  taught  separately 
specialised  curricula  for  the  two  different 
branches.  The  core  disciplines  are 
Management  and  Informatics  (Introduction  to 
Computer  Science)  for  both  branches,  Subject 
Indexing  and  Cataloguing  (for  Librarians)  and 
Archivology  and  Paleography  (for  Archivists). 
A  rather  strict  numerus  dausus  does  not  allow 
the  enrolment  of  more  than  twenty  students 
for  the  librarians  option  and  ten  for  the 
archivists,  annually. 

4*  THE  PROGRAMME  FOR  THE 
DEVELOPMENT  OF  THE  INFORMATION 
SYSTEM  FOR  INDUSTRY 

In  the  previous  paragraphs  the  main  points  of 
the  movement  towards  the  modernisation  of 
Portuguese  society,  were  outlined.  A 
chronological  picture  of  the  economic 
development  experienced  in  the  last  four  or 
five  decades,  and  a  general  overview  of  the 
present  information  infrastructure  was  given. 
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The  Programme  for  the  Development  of  the 
Information  System  for  Industry  undertaken  in 
1967,  under  the  auspices  of  the  Ministry  of 
Industry  and  Energy  (MIE),  must  be  analysed 
in  the  context  of  the  above  framework.  This 
experimental  Programme  has  already  been 
described  in  detail  elsewhere  (13,14).  Briefly, 
the  Programme  aimed  to  create  Information 
Nodes  in  several  Industrial  Associations  -IA 
(the  Portuguese  equivalent  to  Chambers  of 
Commerce).  The  Information  Nodes  should  be 
staffed  by  two  qualified  Information 
Intermediaries  and  furnished  with  a 

Microcomputer  linked  to  the  PPSN,  fax  and 
telex  facilities  and  with  a  basic  reference 
collection.  Financial  and  technical  support  to  be 
given  by  CEDINTEC  -  Centre  for 

Technological  Development  and  Innovation  (an 
agency  for  the  encouragement  of  technical 
innovation  of  the  industry),  and  by 

LNETI/CITI,  respectively. 

An  Information  System  is  essential  to  the 
overall  project  to  modernise  the  industry,  and 
it  is  always  mentioned  as  a  priority  in  policy 
documents.  In  the  MIE,  several  other  projects 
to  develop  such  system  were  proposed  but,  for 
various  reasons  toiled. 

We  think  that  the  uniqueness  of  this 

Programme  depends  on  the  three  prerequisites 
considered  the  key  factors  to  success: 

•  to  create  a  light  and  flexible  structure, 
with  the  technological  facilities  to  allow 
the  access  to  world-wide  information 
sources; 

-  to  involve  the  potential  users  in  the 
project; 

•  to  rely  heavily  upon  the  ability  of  the 
Information  Intermediary  to  bridge  the 
communication  gap  between  the 
information  sources  and  the  end  user. 

The  participation  of  the  IA  proved  to  be  a 
useful  means  of  bringing  toe  system  closer  to 
users.  And  the  configuration  designed  for  the 
Information  Nodes  seems  to  have  been  the 
more  suitable,  if  we  consider  the  funds 
available  for  a  three  years  Programme. 


The  Information  Intermediary  was  seen  as  key 
player  in  the  system.  Considering  the 
Portuguese  Industrial  structure,  the  system 
was  intended  to  meet  the  information  needs  of 
small  and  medium  enterprises,  which  were 
described  above  (see  2.2),  in  the  different 
regions  of  the  country. 

To  promote  effectively  the  transfer  of 
information  to  this  category  of  users,  requires 
specific  qualifications  such  as  toe  ability  to: 

-  seek,  identify,  select,  process  and 
present  information  in  a  form  adapted  to 
toe  specific  needs  of  users  in  industry: 

•  guarantee  the  efficient  transfer  of 
information  from  all  sources  available  in 
the  country  and  abroad; 

-  identify/survey  the  information  needs 
resulting  from  the  activity  of  industrial 
firms; 

-  contribute  towards  the  identification  of 
existing  information  resources  at 
national  level  which  can  satisfy  the 
information  needs  detected. 

Facing  the  lack  of  training  opportunities  for 
information  specialists,  as  seen  above  (see 
3.7),  it  was  decided  to  organise  a  training 
programme  specifically  aimed  to  prepare  the 
Intermediaries  who  would  staff  the  Nodes  of 
toe  System. 

5-  THE  POST-GRADUATION  COURSE 
FOR  INFORMATION  INTERMEDIARIES 

The  training  programme  developed  by 
LNETI/CITI  in  collaboration  with  the 
Department  of  Information  Studies  of  The 
University  of  Sheffield  has  already  been 
analysed  in  various  occasions  (15). 

Because  the  impact  of  the  course  is  far 
beyond  the  initial  objectives  and  for  reasons  of 
clarity,  the  main  lines  of  the  course  are 
described  again. 

The  course  is  structured  in  two  terms  with  a 
total  of  750  teaching  hours,  theoretical, 
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practical  and  tutorial.  The  broad  areas  of  the 
curricukim  are: 

i)  Information  in  its  economic,  political,  social 
and  cultural  context; 

i  Information  processing  and  application  of 
information  technology  in  developing  and 
implementing  information  systems; 


Bi)  Information  resources; 

iv)  Information  transfer  and  technology 
transfer. 

A  very  wide  range  of  topics  was  covered  by 
the  13  modules  of  the  course  as  it  is  outlined  in 
Table  1. 


Table  1 


Modules  of  Term  1 

Modules  of  Term  2 

Intensive  course  in  English 

Information  Science 

Information  Marketing 

Information  Processing  and  Microcomp. 
Applications 

Information  Resources 

Technology  Transfer 

Expert  Systems 

Applic.  of  Microcomp,  in  Info 
Management 

Business  Information 

Info.  Management 

National  Info.  Resources 

Research  Methods 

Info.  Needs  and  design  of  Info. 
Services 

An  evaluation  was  carried  out  after  the  first 
course,  held  between  October  1987  and  April 
1988,  and  a  number  of  changes  to  the 
programme  were  recommended  (16).  Some  of 
the  suggestions  were  implemented  in  the  next 
two  courses:  for  instance  foe  presence  in  the 
first  month  of  the  course,  of  representatives 
of  the  information  providers  and  information 
users  from  business  and  industry  to  discuss 
the  problems  of  information  provision  and  use. 
Also  the  introduction  of  a  new  module 
concerning  foe  business  scene  in  Portugal 
including  foe  treatment  of:  how  Portuguese 
business  and  industry  operates  in  general 
terms;  the  different  types  of  enterprises  and 
the  need  for  change  towards  1992  and  the 
open  market;  the  main  lines  of  economic  and 
social  policy  as  they  affect  enterprises. 


From  the  point  of  view  of  the  course 
organisers,  the  above  mentioned  aspects 
relating  to  foe  business  scene  and  the  field 
work  done  as  part  of  foe  National  Information 
Resources,  are  of  considerable  value  for  foe 
future  professional  accomplishment  of  the 
Intermediaries.  The  same  applies  to  the 
training  given  in  foe  commurvcation  aspects 
involved  in  the  information  transfer  process. 

6-  IMPACT  OF  TRAINING  ON  JOB 
ENGAGEMENT  AND  PERFORMANCE 

6.1*TheQuestionnalra 

In  order  to  evaluate  the  actual  impact  of  the 
training  provided  through  foe  Post-Graduation 
Course  for  Information  Intermediaries  on  job 
engagement  and  performance  a  questionnaire 
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was  prepared.  It  contained  5  sections  with  a 
total  of  21  questions,  covering  items  such  as 
the  actual  professional  situation,  mode  of 
recruitment,  main  activities  performed, 
utilisation  of  technology  and  perceived 
contribution  of  specific  skills  acquired  or 
developed  throughout  the  course,  to  the 
professional  performance  of  the  former 
students. 

The  questionnaire  was  sent  by  mail  to  all  the 
35  students  who  had  attended  the  first  and 
second  courses  in  the  academic  years  of 
1967*1968  and  1968*1969. 

Although  the  main  body  of  questions  were 
closed  questions,  those  concerning  the 
competencies  (understood  as  a  body  of 
knowledge,  skills  and  attitudes)  required  to  the 
fulfilment  of  the  job  and  those  concerning  the 
main  activities  performed  were  open-ended,  in 
order  to  provide  as  much  information  as 
possible  on  those  items. 


The  questionnaire  was  first  tested  with  the 
collaboration  of  two  former  students.  21 
responses  were  received,  corresponding  to 
60%  of  the  total  number  of  questionnaires 
issued  in  March  1990. 

6.2  -Analysts  of  the  Responses 

6.2.1  -Professional  Situation 

Only  one  of  the  respondents  is  not  working  in 
the  field  due  to  military  service  obligations. 
The  other  20  are  distributed  in  a  fairly 
balanced  way  over  the  private  and  public 
sectors  and  earning  salaries  from  Esc. 
70.000$00  to  300.000$00  (see  table  2).  Both 
private  and  public  sector  jobs  are  based 
heavily  on  the  services  sector.  40%  of  the 
respondents  are  on  their  second  job  after 
having  finished  the  course,  and  35%  of  this 
group  are  performing  the  job  of  Information 
Intermediaries. 


Table  2:  Job  Sector  and  Wages 


Wages 

(Portuguese  Escudos) 

publ. 

% 

prlv. 

% 

coop 

% 

200.000/299.000 

10 

m 

150.000/199.000 

5 

10 

- 

100.000/149.000 

35 

20 

• 

70.000/  99.000 

5 

10 

5 

&2L2-AeauRment 

The  mode  of  recruitment  used  was  mainly 
direct  invitation,  40%  (see  Table  3)  and 
interview  was  the  most  common  method  of 


selection,  70%  .  When  asked  if  they  had 
mentioned  their  post-graduation  course  during 
the  selection  process,  90%  answered 
affirmatively  and  among  these,  70%  think 
that  it  was  decisive  for  their  final  recruitment. 
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Table  3:  Racrultmant 


Advert 

30% 

Invitation 

40% 

By  indication  of 

30% 

623 •  Competencies  Required 

70%  of  the  respondents  mention  as 
competencies  required  by  employers  for  the 
fulfilment  of  the  job,  a  graduation  plus 
additional  specific  skills  and/or  attitudes.  30% 
mention  a  graduation  plus  specifically  the 
Post-Graduation  Course  for  Information 
Intermediaries  without  further  requirements. 


This  leads  us  to  believe  that  the  course  was 
regarded  by  employers  as  containing  the  full 
range  of  knowledge,  skills  and  attitudes 
required  for  the  job  they  were  offering.  Table  5 
shows  respectively  the  skills  and  attitudes 
required  by  employers  as  a  supplement  to  the 
graduation:  computer  skills,  and  the 
interpersonal  communication  skills  are  the 
most  popular  requirements. 


Table  5:  Competencies 


Skills 

% 

Attitudes 

% 

Computing 

45 

Ability  to  adapt 

10 

Forg.Lang 

10 

Ability  to  Commun 

25 

Telecommun 

5 

Ability  to  Innovate 

10 

Leadership 

5 

Team  spirit 

10 

624  -  Main  Activities  Performed 

Among  the  activities  listed  in  Table  6  (several 
answers  were  allowed),  retrieving  information 
for  one's  own  utilisation  is  mentioned  by  90% 
of  the  respondents  and  participation  in 
meetings  and  report  writing  is  referred  by 
80%;  retrieving  information  to  provide  to 
someone  else  and  contacting  people  outside  the 
organisation  comes  next  70%.  However,  when 


asked  which  activities  were  most  frequently 
performed,  retrieving  information  to  provide  to 
someone  else  comes  first,  followed  by 
answering  information  requests.  The  remaining 
activities  mentioned  by  the  respondents  are 
highly  scattered  according  to  specific 
requirements  of  the  different  organisations, 
where  the  activities  are  performed. 
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Table  6:  Activities  performed 


Contacting  people  inside  the  Org. 

55% 

Contacting  people  outside  the  Org 

75% 

Answering  info,  requests 

70% 

Retrieving  Info,  to  users 

75% 

Retrieving  Info,  for  one's  utilisation 

90% 

lndex.,Catalog., Class.,  Abstract. 

10% 

Storing  Information 

55% 

Disseminating  Info. 

65% 

Meetings 

80% 

Writing  reports 

80% 

6.25  -  Utilisation  of  Technology 

Table  7  shows  the  level  of  utilisation  of 
microcomputer  packages  in  the  activities  of 
the  Intermediaries,  with  high  rates  concerning 
DBMS  and  word-processors  (80%)  and 
spreadsheets  (70%).  Table  10  displays  the 
means  used  daily  to  communicate  inside  and 
outside  the  organisation,  with  a  heavy  stress 


for  personal  communication  inside  the 
organisation  (90%)  and  for  telephone  inside 
and  outside  the  organisation  (60%  and  65%). 
Fax  machines  may  be  considered  to  have  a 
relatively  light  use  (25%)  and  electronic  mail  is 
little  used  as  a  means  of  communication  both 
inside  and  outside  the  organisation  (5%  and 
10%). 


Table  7:  Microcomputer  packages 
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Table  8:  Means  of  communication  used 
Inside/outside  the  Organisation 


IN 

OUT 

Personal  contact 

90% 

20% 

Telephone 

60% 

65% 

Mail 

20% 

20% 

Telex 

- 

- 

Fax 

25% 

25% 

E-Mail 

5% 

10% 

62£-  Perceived  contribution  of  specific  skills 
acquired  throughout  the  course 

As  for  the  answers  to  whether  and  how  much 
the  course  as  a  whole  has  contributed  to  their 
professional  performance,  25%  state  that  the 
course  is  extremely  useful  and  30%  state  that 
it  is  very  useful,  (totalling  55%),  while  5% 
think  it  is  useless  for  the  job  they  are  now 
performing. 


A  detailed  analysis  of  the  contribution  made 
by  specific  skills  acquired  or  developed  during 
the  course  (see  Table  9)  shows  that  in  fact  9 
out  of  1 1  of  the  skills  listed,  collected  from 
50%  to  80%  agreement  that  they  are 
extremely  or  very  useful  for  their  actual 
performance.  The  skills  most  valued  are  in 
fact  the  technological  skills,  followed  by  the 
knowledge  of  information  sources. 


Table  9:  Perceived  contribution  of  specific  skills 
acquired  through  the  course 


Skills 

1 

Level  of  contrib 

2  3  4 

5 

4 

+  5 

Technology 

- 

3 

- 

6 

11 

17 

85% 

Info.  Sources 

- 

3 

1 

7 

9 

16 

80% 

Online  searching 

2 

3 

5 

3 

7 

10 

50% 

Info.for  Industry 

2 

2 

3 

5 

8 

13 

65% 

Portuguese  Indust 

1 

3 

6 

5 

5 

10 

20% 

Organis.of  Info 

- 

4 

4 

7 

5 

12 

60% 

Present,  of  Info 

- 

1 

5 

7 

7 

14 

70% 

Info.  Needs 

- 

3 

2 

7 

8 

15 

75% 

English  Lang 

- 

3 

3 

8 

6 

14 

70% 

Communication  Sk. 

- 

2 

4 

4 

10 

14 

70% 

Technology  Transf 

2 

3 

7 

7 

1 

8 

40% 
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7  -  CONCLUSIONS 

The  outline  made  in  sections  1  and  2  provides 
an  overall  view  about  the  level  of  economic 
development  in  Portugal  and  its  industrial 
structure,  while  section  3  analyses  the 
Portuguese  Information  Infrastructure.  They 
are  the  basis  to  understand  the  Programme 
for  the  Development  of  the  Information 
System  for  Industry  and  the  Post-Graduation 
Course  for  Information  Intermediaries.  This 
background,  together  with  the  results  obtained 
from  the  survey  to  the  former  students,  allow 
us  to  formulate  the  following  conclusions: 

i)  The  information  infrastructure  in  Portugal 
suffers  from  a  severe  shortage  of 
qualified  information  professionals.  This  is 
confirmed  by  die  high  rate  of  job  mobility 
among  the  recently  graduated  information 
intermediaries  who  look  for  better  jobs  and 
wages  and  succeed  in  getting  them. 

ii)  The  Post-Graduation  Course  for 
Information  Intermediaries  has  contributed 
both  to  increase  the  number  of  qualified 
information  professionals  prepared 
annually  and  to  enlarge  the  range  of 
information  competencies  available  at  a 
national  level  due  to  the  innovative 
features  of  its  curriculum.  In  general,  the 
course  is  seen  by  the  former  students  as 
an  important  aditionai  specialisation  that 
enables  them  to  perform  information 
related  activities  in  any  professional 
environment  with  a  high  level  of 
achievement. 

iii)  As  would  be  expected  from  an  incipient 
information  industry  and  information 
services  market,  the  use  of  information 
technology  is  also  incipient,  even  though 
technological  skills  are  highly  appreciated 
both  by  employers  and  employees. 

iv)  Interpersonal  communication  skills  are  also 
highly  appreciated  and  wanted  by 
employers  and  prove  to  be  one  of  the  more 
frequent  means  of  information  transfer. 


Considering  the  present  state  of  development 
of  the  Portuguese  information  infrastructure 
we  think  that  a  major  contribution  to  reduce 
the  communication  gap  is  still  to  invest  further 
in  education  and  professional  training.  In  fact, 
the  technology  is  becoming  available  and 
gradually  used  throughout  the  country,  the 
information  industry  is  giving  its  first  steps 
but,  as  was  widely  recognised  by  all  the  key 
players  in  the  Information  Scene  in  Portugal 
the  lack  of  human  specialisation  is  the  main 
constraint  to  the  development  of  the  sector  in 
particular  and  the  country  in  general. 

However,  the  experience  of  the  Programme 
for  the  Development  of  the  Information 
System  for  Industry  and  in  particular  the 
Post-Graduation  Course  for  Information 
Intermediaries  show  that  the  development  of  a 
light  and  flexible  structure  integrating 
technologies  and  qualified  human  resources  is 
probably  the  most  effective  way  at  present, 
to  allow  the  Portuguese  industrial  firms  to 
access  the  information  available  world-wide  in 
the  same  conditions  of  their  foreign 
competitors. 
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1.  INTRODUCTION 

This  paper  was  prepared  by  the  Theme  Coordinator  after  he  had 
read  the  other  contributions.  Sections  2  and  4  were  written 
in  French  and  Section  3  in  English.  The  text  presented  here 
consists  of  English  and  French  versions  of  the  complete  text, 
each  including  some  translations  from  the  other. 


2.  AN  INFORMATION  NETWORK 

The  purpose  of  this  paper  is  to  draw  out  the  many  conclusions 
arising  from  the  other  papers  presented  at  the  meeting  and, 
above  all,  to  identify  the  tasks  ahead  for  information  centre 
managers  and  recommendations  that  they  could  make  to  their 
superiors  or  to  Government  departments  with  responsibility  for 
information  policy,  at  least  in  those  countries  where  there  is 
a  national  information  policy.  Where  there  is  no  such  policy, 
concerted  action  should  be  taken  to  persuade  a  ministry  or 
other  official  body  to  define  one,  based  on  the 
recommendations  arising  from  this  meeting. 

It  is  of  course  a  very  ambitious  project,  but  if  all 
information  centre  managers  wish  to  be  able  to  respond  fully 
to  the  desires  and  needs  of  their  end-users  (who  attach  more 
and  more  importance  to  relevance  of  information  and 
particularly  to  speedy  delivery  of  it  in  their  own  language) 
it  will  be  essential  to  develop  and  provide  for  their  use  all 
the  necessary  means  (logistic,  personnel,  financial  and 
training) ,  to  enable  them  to  exploit  fully  and  effectively  the 
new  technologies  which  will  soon  be  used  in  research  and  for 
information  transfer. 

It  is  obvious  that  not  all  the  recommendations  can  be  applied 
in  all  countries,  or  even  in  all  information  centres,  but  they 
should  be  considered  as  guidelines  for  the  definition  of 
information  policies  in  each  country  and  in  each  information 
centre.  We  live  now  in  a  world  which  has  a  permanent  need  for 
information.  This  means  that  we  should  all  use  compatible 
methods  and  means  for  the  transfer  of  information.  Only  if  we 
do  so,  can  we  bridge  the  existing  communication  gaps. 

To  do  so,  we  must  first  of  all  have  in  mind,  or  be  able  to 
define,  guidelines  for  the  organisation  of  an  information 
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network,  in  which  each  participant  should  be  considered  both 
as  a  supplier  and  a  user  of  information.  Such  a  network  can 
operate  only  on  the  principle  of  reciprocity  in  which  the 
members  of  the  network  all  agree  to  share  the  tasks.  In  this 
way,  each  member  can  have  quick  access  to  a  mass  of 
information  at  low  cost.  But  that  is  possible  only  if  the 
operations  of  the  members  of  the  network  are  fully  compatible 
with  one  another. 


SCHEMATIC  REPRESENTATION  OF  AN  INFORMATION  NETWORK 


This  type  of  network  can  be  adopted  either  at  a  national  level 
or  within  an  organisation  or  a  company.  It  is  a  multiple 
communication  network  which  must  be  totally  transparent  to  the 
end-user. 
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Based  on  this  plan,  and  on  the  papers  presented  during  this 
meeting,  the  main  recommendations  can  be  classified  under  the 
following  headings,  each  of  which  is  elaborated  below: 

1.  Logistic  support 

2 .  Standardisation 

3.  Artificial  Intelligence 

4.  Assistance  to  information  suppliers/users 

5 .  Knowledge  base 


3.  RECOMMENDATIONS 
3.1  Logistic  Support 
Organisational 

Creating  a  coordinating  body  responsible  for  the 
definition  of  a  documentation  policy  and  its 
implementation ; 

Defining  and  organising  an  information  network. 

Communication  Networks 

developing  and  introducing  ISDN  networks? 

developing  and  introducing  the  use  of  optic  fibres; 

planning  different  types  of  communication  networks: 
centralised,  decentralised  and  with  multiple  connections. 

Computec-Systeme 

Giving  priority  to  a  high-performance  and  user-friendly 
electronic  storage  system  connected  to  the  existing 
computer  application  environment  (in  most  cases,  PC 
terminals  can  be  connected  to  large  host  systems) . 

2-t  2.t_..staD<*ar<aiga:£l9n 

Using  international  standards  for  the  acquisition  of  texts  and 
pictures  in  computer  systems  (from  the  writer  to  the  editor  of 
such  information) . 

Harmonising  the  procedure  to  access  to  various  information 
service  centers  in  consultation  with  the  system  designers  and 
users . 

3.3  Artificial  Intelligence 

s 

Studying  and  developing  automatic  systems  for  indexing  or 
abstracts  (concept  extraction) . 

Studying  and  developing  connection  tools  to  improve  access  to 
service  centres  and  intelligent,  multilingual  and  user- 
friendly  databases. 
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3.4  Assistance  to  Information  Suppliers/Users 

Developing  computer  assisted  writing  systems  (to  deal  with 
problems  such  as  badly-written  texts,  ambiguous  sentences, 
spell  checking,  etc..). 

Creating  multilingual  terminology  databanks  whose 
applicability  is  to  be  checked  by  user  countries. 

Developing  Computer  Assisted  Translation  (CAT)  systems. 

Improving  CAT  workbenches. 

Training  the  potential  users  of  new  technologies. 

3.5  Knowledge  Base 

Organising  and  structuring  the  selection  and  acquisition  of 
past,  current  and  future  information  to  be  recorded  in  a 
national  databank  (know-how/grey  matter  of  a  country) ,  while 
keeping  the  confidential  characteristics  of  circulated 
information. 

Defining  the  structure  and  contents  of  the  databases. 


4.  CONCLUSION 

In  outline  those  are  the  main  recommendations  that  we  can  make 
to  the  competent  authorities.  It  is  difficult  to  specify  an 
order  of  priority,  but  if  we  want  to  bridge  the  communication 
gap  in  all  its  forms,  these  recommendations  must  be  developed 
and  used  very  quickly.  However,  when  we  examine  the  list  it 
is  clear  that  some  actions  have  already  been  started  in  all 
the  areas  mentioned,  with  proposals  for  funding  which  should, 
in  most  cases,  allow  the  completion  and  marketing  of  all  the 
projects  under  consideration. 

Why,  then,  should  we  make  any  recommendations? 

1.  Because  users  should  work  with  system  designers,  in 
order  to  make  their  needs  known. 

2.  Because  all  the  systems  developed  should  be  usable 
within  the  existing  office  automation  environment,  should 
be  compatible  among  themselves  and  should  be  able  to  form 
part  of  a  general  information  network.  Hence  the 
importance  of  coordination  by  an  official  organisation. 

3.  In  order  to  convice  our  authorities  of  the  need  for 
an  official  coordinating  organisation  charged  with 
defining  an  information  policy  and  initiating  it. 

Very  briefly,  then,  these  are  the  reasons  which  lead  us  to 
make  recommendations  that  can  be  addressed  to  the  appropriate 
authorities  and  to  take  actions,  particularly  to  ensure  the 
training  of  the  personnel  who  will  have  to  use  these  new 
technologies. 
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PROJETS  DE  RECOMMANDATIONS  A  FAIRS  AUPRES  DES  HINISTERES  ET  AUPRES  D'AUTRES 
AUTORITES  SUR  LA  BASE  DES  COMMUNICATIONS  PRESENTEES  LORS  DE  LA  CONFERENCE 


par 


OLEG  LAVROFF 
5  rue  Anna  Jacquin 
92100  Boulogne 
France 

(Chef,  retraitS ,  de  la  Section  Information  d'AArospatiale ,  Suresnes, 


France) 


1 .  INTRODUCTION 


Le  coordonnateur  du  thdme  de  la  conference  a  rddige  cette  note  apres  avoir  lu  les 
interventions  des  autres  conferenciers .  Les  sections  2  et  4  ont  Ate  redigees  en 
franfais,  et  la  section  3  en  anglais. 

2.  UN  RESEAU  D ' INFORMATIONS 


L'objet  de  cette  intervention  est  de  digager  les  principales  conclusions  qui 
dicoulent  de  toutes  les  interventions  faites  a  ce  Congres,  et  surtout,  de  mettre  en 
evidence  les  efforts  que  devront  mener  les  responsables  des  Centres  d' Informations , 
et  les  recommandations  qu'ils  pourraient  faire  aupres  de  leurs  instances 
hierarchiques  ou  des  Ministires  chargis  de  definir  une  politique  documentaire ,  dans 
le  cas  bien  sur,  ou  une  politique  documentaire  nationale  existe  dans  un  pays.  Si  ce 
n'est  pas  le  cas,  il  faut  tenter  par  des  actions  concertees ,  de  faire  difinir  par  un 
Ministere  ou  un  organisme  officiel,  une  politique  documentaire  nationale,  basee  sur 
les  recommandations  issues  de  ce  Congris. 


Tris  certainement ,  il  s'agit  1A  d'un  projet  ambitieux,  mais  si  tous  les  responsables 
des  Centres  d ' Informations  souhaitent  pouvoir  repondre  favorablement  aux  disirs  et 
aux  besoins  des  utilisateurs  finaux  (lesquels  attachent  de  plus  en  plus  d'importance 
A  la  pertinence,  et  surtout,  A  la  riduction  des  delais  dans  la  fourniture  de 
1' information  dans  leur  langue  maternelle) ,  il  s'avere  indispensable  que  tous  les 
moyens  (logistique,  humain,  financier,  de  formation)  soient  developpes  et  mis  a  leur 
disposition,  afin  de  leur  permettre  d'utiliser  pleinement  et  a  meilleur  profit,  les 
nouvelles  technologies  qui  seront  utilisdes  A  court  terme  dans  la  recherche  et  le 
transfert  de  l 'information. 


Il  est  Evident  que  toutes  les  recommandations  ne  peuvent  §tre  retenues  et  appliqudes 
dans  tous  les  pays,  ou  simplement  dans  tous  les  Centres  d' Informations,  mais  ces 
recommandations  doivent  Stre  considdrdes  comme  un  axe  directeur  dans  la  definition 
des  politiques  documentaires  de  chaque  pays  ou  de  chaque  Centre  d ' Informations.  Nous 
vivons  maintenant  dans  un  monde  qui  a  besoin  d' information  en  permanence,  et  cela 
implique  des  methodologies  et  des  moyens  harmonises  reconnus  et  utilises  par  chacun 
d'entre  nous.  Ce  n'est  que  dans  ces  conditions  que  nous  pourrons  combler  les  lacunes 
dans  le  domaine  de  la  communication. 


En  se  basant  sur  ce  schema  de  principe  et  sur  les  exposes  prisentds  au  cours  de  ce 
Congres,  les  principales  recommandations  peuvent  se  rdpartir  sur  les  5  points 
suivants  : 

1)  Moyens  logistiques 

2)  Standardisation 

3)  Intelligence  artificielle 

4)  Aide  aux  utilisateurs  /  fournisseurs  d' informations 

5)  Base  de  connaissances 


3 .  RECOMMANDATIONS 
3.1  Apoui  loeistiaue 
oreanisa ti onnel 

creation  d'un  organe  de  coordination  responsable  de  la 
definition  d'une  politique  de  1 ' information  et  de  sa 
mise  en  application. 

definition  et  organisation  d'un  rAseau  d' information 

reseaux  de  telecommunications 

realisation  et  mise  en  oeuvre  de  reseaux  RNIS 

realisaton  et  mise  en  oeuvre  de  la  solution  fibres 
optiques 

preparation  de  diff Arenas  types  de  reseaux  de 
telecommunications  : 

centralisAs ,  decentralises  et  A  connexions  multiples . 

SYstAmes  informatiques 

donner  la  prior ite  A  un  systAme  d'archivage  electronique  de  haute  performance  et 
convivial  connects  A  1 ' env ironnement  informatique  existant  avec  ces  applications 
(dans  la  plupart  des  cas  des  micro -ordinateurs  peuvent  Acre  utilises  comme 
terminaux  relies  aux  grands  serveurs) . 


utilisation  des  normes  Internationales  pour  l'acquisition  de  textes  et  figures 
dans  les  systemis  informatiques  (du  rAdacteur  A  l’Aditeur  de  1 ’ information) . 


harmonisation  de  la  procedure  d’accAs  aux  divers  centres  et  services 
d' information,  en  consultation  avec  lea  concepteurs  de  systemis  et  les 
utilisateurs. 
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3.3  Intelligence  artifisielle 

Atude  et  realisation  de  systemds  automat iques  d' indexation  ou  d' analyse 
(extraction  de  concepts) . 

etude  ec  realisation  d'outils  de  connexion  pour  ameilorer  1' seeds  aux  centres  et 
services  et  A  leurs  bases  de  donndes  par  des  mo yens  intelllgents ,  multllingues 
et  conviviaux. 

3.4  Assistance  aux  foumisseurs  d 'information  et  &  leurs  utlllsateurs 

rdaliser  des  systemds  de  redaction  de  texte  assistes  par  ordlnateur  (pour 
traiter  les  problemds  tels  que  textes  mal  rddigds ,  phrases  anblgQes ,  contrdle 
orthograph ique ,  etc). 

creer  des  banques  de  donndes  de  termlnologle  multilingue  avec  des  contrdles  et 
validations  exereds  par  les  pays  utilisateurs . 

rdaliser  des  systemds  de  traduction  assistde  par  ordinateur  (TAD) 
amdliorer  les  postes  de  travail  de  TAD. 

assurer  la  formation  ndeessaire  des  utilisateurs  potentiels  des  nouvelles 
technologies  de  1' information. 

3.5  Base  de  connaissances 

organiser  et  structurer  la  selection  et  1 'acquisition  de  1' information  passde, 
prdsente  et  future  A  archiver  dans  une  banque  de  donndes  nationale  (savoir 
faire/matidre  grise  d'un  pays)  tout  en  protdgeant  les  caractdristiques  de 
confidentialitd  de  1' information  mise  en  circulation. 

ddfinir  la  structure  et  le  contenu  des  bases  de  donndes . 


4,  CONCLUSION 

Void  trds  schdmatiquement  les  recommandations  principales  que  nous  pourrions 
transmettre  A  nos  autoritds  compdtentes.  II  nous  parait  difficile  de  prdciser  un 
ordre  d'urgence  dans  les  recommandations  prdsentdes  ci-avant.  Mais  si  l‘on  veut 
combler  les  lacunes  dans  le  domaine  de  la  communication  de  1  'information  sous  toutes 
ses  formes,  il  est  dvident  que  toutes  ces  recommandations  devront  itre  ddveloppdes 
et  utilisdes  trds  rapidement. 


Cependant,  lorsque  1 'on  examine  la  liste  des  recommandations,  on  peut  constater  que 
des  actions  sont  ddJA  ent reprises  pour  1' ensemble  des  sujets  traitds,  avec  des  plans 
de  financement,  qui  devraient  permettre  dans  la  majoritd  des  cas,  de  finaliser  et  de 
commercialiser  les  pro  jets  dtudids  par  les  concepteurs  desys  times. 
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Alors  pour  quelles  raisons  devons  nous  dtablir  des  recommandat ions  ? 

1)  Parce  que  ce  sont  les  utilisateurs  qui  doivent  faire  connaltre  leurs  besoins  et 
dtre  assoc ids  aux  concepteurs,  A  1 'etude  et  au  diveloppement  des  systdmes  dont 
ils  ont  besoin  pour  1 ' accomplissement  de  leurs  Caches. 

2)  Tous  les  systemes  ddveloppds  doivent  pouvoir  s'intdgrer  dans  1  'environnement 
bureautique  existant,  et  surtout,  §tre  compatibles  entre  eux  et  s'intdgrer  dans 
un  rdseau  general  d' informations.  D'ob  1 ' importance  d'une  coordination  des 
travaux  par  un  organisme  officiel. 

3)  Pour  convainc re  nos  autorites  de  la  ndcessitd  de  order  un  organisme  officiel 
coordonnateur ,  charge  de  ddfinir  une  politique  documentaire  et  d'en  assume r 
sa  mise  en  application. 


Void  done  tres  brievement  quelques  raisons  qui  nous  poussent  a  etablir  des 
recommandat ions  que  nous  pourrions  adresser  A  nos  autorites  et  les  actions  a 
entreprendre.  en  veillant  plus  particuliArement  A  la  formation  du  personnel  qui 
sera  amend  a  utiliser  toutes  ces  nouvelles  technologies. 
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